Updated pipeline for small molecule data processing at PDBe

We have updated the tools used at PDBe and PDBe-KB for processing small molecule ligands and their interactions in the PDB archive. This process now identifies larger, linked ligands that are composed of individual components covalently linked together, allowing these molecules to be identified in a standardised manner for the first time.



The PDBe software ‘pdbeccdutils’ is a python library developed for the processing of small molecules in the PDB. This includes all small molecule ligands in the wwPDB Chemical Component Dictionary (CCD) and multi-component ligands in the wwPDB Biologically Interesting Molecule Reference Dictionary (BIRD). This process is used to generate data that is provided on the PDBeChem pages, though much more information is also generated and available in the updated CIF files for CCDs. These files are available on the PDBe pages or via EMBL-EBI FTP area, at ftp.ebi.ac.uk/pub/databases/msd/updated_mmcif/.

The pdbeccdutils library is composed of a number of other python tools and this release integrates new versions of these software, alongside a number of other updates. The library includes Gemmi, a widely adopted tool that is now used for parsing CIF files in pdbeccdutils and allows better interoperability with other tools. It also includes RDKit, another widely used software that is used to generate molecular properties for molecules in the CCD.

The main features of pdbeccdutils include:

  • Reading and writing of CCD files
  • Generation of 2D depictions of ligands
  • Generation of 3D conformations
  • Fragment library search 
  • Lightweight implementation of parity method for identification of similar ligands
  • Identification of chemical scaffolds
  • Calculation of RDKit molecular properties
  • Unichem mapping to other databases (e.g. ChEMBL, DrugBank and more)

Updates to pdbeccdutils includes the addition of a new module to determine ‘bound molecules’, defined as multiple, covalently linked CCD ligands into a single, larger molecule. Though these have previously been defined for some cases in the BIRD, this is the first systematic process to identify and group covalently linked CCDs across the PDB archive. Other updates to pdbeccdutils includes an update to use RDKit 2022.09.x, replacing pdbecif with Gemmi for parsing CIF files, and replacing of ccd_cif_dict with ccd_cif_block, which is a breaking change in the process.


Image showing individual components of larger ligand Myristoyl-Coenzyme A
An example of a bound ligand in PDB entry 6lq4. In the archive file, COA (Coenzyme A, carbons coloured purple) and MYR (Myristic acid, carbons coloured brown) are identified as two separate ligands bound to the protein. However, the complete molecule Myristoyl-Coenzyme A, with a covalent link between the two (coloured magenta), is the substrate for N-myristoyltransferase. The “bm_reader” module in pdbeccdutils can be used to identify such complex components with covalent linkages as a single molecule and generate data for the whole entity, including 2D depictions, inchi and inchikey.


For more information about pdbeccdutils and its use, visit the documentation page at pdbeurope.github.io/ccdutils/guide/intro.html


PDBe Arpeggio

Arpeggio is a python library developed by the Blundell group and is used to calculate interatomic contacts in a protein, including those between ligands and amino acids. We have developed a version of Arpeggio, called PDBe Arpeggio, which uses this underlying library in order to generate interactions data for proteins and ligands in the PDB archive. The interatomic contacts are generated based on rules defined in CREDO, a protein-ligand interaction database. Users should be aware that PDBe Arpeggio only supports input of files in PDBx/mmCIF format, the standard format for the PDB archive.

In this update to the PDBe Arpeggio library (v1.5), we have made updates including the inclusion of chemical component type in the output JSONs that contain interactions data. This allows easy identification of proteins, bound ligands and water molecules, using types of P, B and W, respectively. Furthermore, PDBe Arpeggio now uses the updated Open Babel 3.0.0 version and, as with pdbeccdutils, we have also updated the CIF parser in PDBe Arpeggio to use Gemmi.

For more information about PDBe Arpeggio and its use, visit the documentation page at github.com/PDBeurope/arpeggio.

Read more here: Source link