PDBeCIF: an open-source mmCIF/CIF parsing and processing package

Abstract Background Biomacromolecular structural data outgrew the legacy Protein Data Bank (PDB) format which the scientific community relied on for decades, yet the use of its successor PDBx/Macromolecular Crystallographic Information File format (PDBx/mmCIF) is still not widespread. Perhaps one of the reasons is the availability of easy to use tools that only support the legacy format, but also the inherent difficulties of processing mmCIF files correctly, given the number of edge cases that make efficient parsing problematic. Nevertheless, to fully exploit macromolecular structure data and their associated annotations such as multiscale structures from integrative/hybrid methods or large macromolecular complexes determined using traditional methods, it is necessary to fully adopt the new format as soon as possible. Results To this end, we developed PDBeCIF, an open-source Python project for manipulating mmCIF and CIF files. It is part of the official list of mmCIF parsers recorded by the wwPDB and is heavily employed in the processes of the Protein Data Bank in Europe. The package is freely available both from the PyPI repository (http://pypi.org/project/pdbecif) and from GitHub (https://github.com/pdbeurope/pdbecif) along with rich documentation and many ready-to-use examples. Conclusions PDBeCIF is an efficient and lightweight Python 2.6+/3+ package with no external dependencies. It can be readily integrated with 3rd party libraries as well as adopted for broad scientific analyses.

Download Full-text

pdb-tools: a swiss army knife for molecular structures

F1000Research ◽

10.12688/f1000research.17456.1 ◽

2018 ◽

Vol 7 ◽

pp. 1961 ◽

Cited By ~ 6

Author(s):

João P. G. L. M. Rodrigues ◽

João M.C. Teixeira ◽

Mikaël Trellet ◽

Alexandre M. J. J. Bonvin

Keyword(s):

Molecular Structure ◽

Open Source ◽

Protein Data Bank ◽

Structure Data ◽

Data Bank ◽

Molecular Structures ◽

Command Line ◽

Efficient Manner ◽

Link Type

The pdb-tools are a collection of Python scripts for working with molecular structure data in the Protein Data Bank (PDB) format. They allow users to edit, convert, and validate PDB files, from the command-line, in a simple but efficient manner. The pdb-tools are implemented in Python, without any external dependencies, and are freely available under the open-source Apache License at https://github.com/haddocking/pdb-tools/ and on PyPI.

Download Full-text

PDBrenum: a webserver and program providing Protein Data Bank files renumbered according to their UniProt sequences

10.1101/2021.02.14.431128 ◽

2021 ◽

Author(s):

Bulat Faezov ◽

Roland L. Dunbrack

Keyword(s):

Protein Data Bank ◽

Data Bank ◽

Post Translational Modifications ◽

X Ray ◽

X Ray Crystallography ◽

Link Type ◽

Binding Partners ◽

Cryo Electron Microscopy ◽

Comparative Structure ◽

In The Beginning

AbstractThe Protein Data Bank (PDB) was established at Brookhaven National Laboratories in 1971 as an archive for biological macromolecular crystal structures. In the beginning the archive held only seven structures but in early 2021, the database has more than 170,000 structures solved by X-ray crystallography, nuclear magnetic resonance, cryo-electron microscopy, and other methods. Many proteins have been studied under different conditions (e.g., binding partners such as ligands, nucleic acids, or other proteins; mutations and post-translational modifications), thus enabling comparative structure-function studies. However, these studies are made more difficult because authors are allowed by the PDB to number the amino acids in each protein sequence in any manner they wish. This results in the same protein being numbered differently in the available PDB entries. In addition to the coordinates, there are many fields that contain information regarding specific residues in the sequence of each protein in the entry. Here we provide a webserver and Python3 application that fixes the PDB sequence numbering problem by replacing the author numbering with numbering derived from the corresponding UniProt sequences. We obtain this correspondence from the SIFTS database from PDBe. The server and program can take a list of PDB entries and provide renumbered files in mmCIF format and the legacy PDB format for both asymmetric unit files and biological assembly files provided by PDBe. The server can also take a list of UniProt identifiers (“P04637” or “P53_HUMAN”) and return the desired files.AvailabilitySource code is freely available at https://github.com/Faezov/PDBrenum. The webserver is located at: http://dunbrack3.fccc.edu/[email protected] or [email protected].

Download Full-text

The Popgen Pipeline Platform: A Software Platform for Facilitating Population Genomic Analyses

10.1101/785774 ◽

2019 ◽

Author(s):

Andrew Webb ◽

Jared Knoblauch ◽

Nitesh Sabankar ◽

Apeksha Sukesh Kallur ◽

Jody Hey ◽

...

Keyword(s):

Open Source ◽

Development Time ◽

End Users ◽

File Format ◽

Software Platform ◽

Format Conversion ◽

Link Type ◽

Population Genomic ◽

Genomic Analyses ◽

File Format Conversion

AbstractHere we present the Pop-Gen Pipeline Platform (PPP), a software platform with the goal of reducing the computational expertise required for conducting population genomic analyses. The PPP was designed as a collection of scripts that facilitate common population genomic workflows in a consistent and standardized Python environment. Functions were developed to encompass entire workflows, including: input preparation, file format conversion, various population genomic analyses, output generation, and visualization. By facilitating entire workflows, the PPP offers several benefits to prospective end users - it reduces the need of redundant in-house software and scripts that would require development time and may be error-prone, or incorrect. The platform has also been developed with reproducibility and extensibility of analyses in mind. The PPP is an open-source package that is available for download and use at https://ppp.readthedocs.io/en/latest/PPP_pages/install.html

Download Full-text

Analyzing Motion Properties of Proteins Affected by Localized Structures From a Robot Kinematics Perspective

Volume 5A: 39th Mechanisms and Robotics Conference ◽

10.1115/detc2015-47010 ◽

2015 ◽

Author(s):

Keisuke Arikawa

Keyword(s):

Protein Data Bank ◽

Complex Shape ◽

Structural Data ◽

Data Bank ◽

Robot Kinematics ◽

Motion Prediction ◽

Serial Manipulators ◽

Localized Structures ◽

Motion Modes ◽

Structural Compliance

On the basis of robot kinematics, we have thus far developed a method for predicting the motion of proteins from their 3D structural data given in the Protein Data Bank (PDB data). In this method, proteins are modeled as serial manipulators constrained by springs and the structural compliance properties of the models are evaluated. We focus on localized instead of whole structures of proteins. Employing the same model used in our method of motion prediction, the motion properties of the localized structures and the relation between the motion properties of localized and whole structures are analyzed. First, we present a method for graphically expressing the deformation of objects with a complex shape, such as proteins, by approximating the shape as a rectangular prism with a mesh on its surface. We then formulate a method for comparing the motion properties of localized structures cleaved from the whole structure and those remaining in it by expressing the motion of the latter using the decomposed motion modes of the former according to the structural compliance. Finally, we show a method for evaluating the effect of a localized structure on the motion properties of proteins by applying forces to localized structures. In the formulations, we demonstrate applications as illustrative examples using the PDB data of a real protein.

Download Full-text

Protein Data Bank Japan (PDBj): maintaining a structural data archive and resource description framework format

Nucleic Acids Research ◽

10.1093/nar/gkr811 ◽

2011 ◽

Vol 40 (D1) ◽

pp. D453-D460 ◽

Cited By ~ 88

Author(s):

A. R. Kinjo ◽

H. Suzuki ◽

R. Yamashita ◽

Y. Ikegawa ◽

T. Kudou ◽

...

Keyword(s):

Protein Data Bank ◽

Resource Description Framework ◽

Structural Data ◽

Data Bank ◽

Data Archive ◽

Description Framework ◽

Resource Description

Download Full-text

A chemical interpretation of protein electron density maps in the worldwide protein data bank

10.1101/613109 ◽

2019 ◽

Cited By ~ 3

Author(s):

Sen Yao ◽

Hunter N.B. Moseley

Keyword(s):

Protein Data Bank ◽

Electron Density ◽

Structural Model ◽

Data Bank ◽

X Ray ◽

New Methods ◽

Link Type ◽

Density Maps ◽

Structure Factors ◽

Python Package

AbstractHigh-quality three-dimensional structural data is of great value for the functional interpretation of biomacromolecules, especially proteins; however, structural quality varies greatly across the entries in the worldwide Protein Data Bank (wwPDB). Since 2008, the wwPDB has required the inclusion of structure factors with the deposition of x-ray crystallographic structures to support the independent evaluation of structures with respect to the underlying experimental data used to derive those structures. However, interpreting the discrepancies between the structural model and its underlying electron density data is difficult, since derived electron density maps use arbitrary electron density units which are inconsistent between maps from different wwPDB entries. Therefore, we have developed a method that converts electron density values into units of electrons. With this conversion, we have developed new methods that can evaluate specific regions of an x-ray crystallographic structure with respect to a physicochemical interpretation of its corresponding electron density map. We have systematically compared all deposited x-ray crystallographic protein models in the wwPDB with their underlying electron density maps, if available, and characterized the electron density in terms of expected numbers of electrons based on the structural model. The methods generated coherent evaluation metrics throughout all PDB entries with associated electron density data, which are consistent with visualization software that would normally be used for manual quality assessment. To our knowledge, this is the first attempt to derive units of electrons directly from electron density maps without the aid of the underlying structure factors. These new metrics are biochemically-informative and can be extremely useful for filtering out low-quality structural regions from inclusion into systematic analyses that span large numbers of PDB entries. Furthermore, these new metrics will improve the ability of non-crystallographers to evaluate regions of interest within PDB entries, since only the PDB structure and the associated electron density maps are needed. These new methods are available as a well-documented Python package on GitHub and the Python Package Index under a modified Clear BSD open source license.Author summaryElectron density maps are very useful for validating the x-ray structure models in the Protein Data Bank (PDB). However, it is often daunting for non-crystallographers to use electron density maps, as it requires a lot of prior knowledge. This study provides methods that can infer chemical information solely from the electron density maps available from the PDB to interpret the electron density and electron density discrepancy values in terms of units of electrons. It also provides methods to evaluate regions of interest in terms of the number of missing or excessing electrons, so that a broader audience, such as biologists or bioinformaticians, can also make better use of the electron density information available in the PDB, especially for quality control purposes.Software and full results available athttps://github.com/MoseleyBioinformaticsLab/pdb_eda (software on GitHub)https://pypi.org/project/pdb-eda/ (software on PyPI)https://pdb-eda.readthedocs.io/en/latest/ (documentation on ReadTheDocs)https://doi.org/10.6084/m9.figshare.7994294 (code and results on FigShare)

Download Full-text

Pairs and Pairix: a file format and a tool for efficient storage and retrieval for Hi-C read pairs

10.1101/2021.08.24.457552 ◽

2021 ◽

Author(s):

Soohyun Lee ◽

Carl Vitzthum ◽

Burak H. Alver ◽

Peter J. Park

Keyword(s):

Quality Control ◽

Open Source ◽

Source Code ◽

Three Dimensional ◽

File Format ◽

Interaction Data ◽

Text File ◽

Storage And Retrieval ◽

Link Type ◽

Efficient Storage

AbstractSummaryAs the amount of three-dimensional chromosomal interaction data continues to increase, storing and accessing such data efficiently becomes paramount. We introduce Pairs, a block-compressed text file format for storing paired genomic coordinates from Hi-C data, and Pairix, an open-source C application to index and query Pairs files. Pairix (also available in Python and R) extends the functionalities of Tabix to paired coordinates data. We have also developed PairsQC, a collapsible HTML quality control report generator for Pairs files.AvailabilityThe format specification and source code are available at https://github.com/4dn-dcic/pairix, https://github.com/4dn-dcic/Rpairix and https://github.com/4dn-dcic/[email protected] or [email protected]

Download Full-text

GeoMine: interactive pattern mining of protein–ligand interfaces in the Protein Data Bank

Bioinformatics ◽

10.1093/bioinformatics/btaa693 ◽

2020 ◽

Author(s):

Konrad Diedrich ◽

Joel Graef ◽

Katrin Schöning-Stierand ◽

Matthias Rarey

Keyword(s):

Protein Data Bank ◽

Web Application ◽

Pattern Mining ◽

Structural Data ◽

Data Bank ◽

Supplementary Information ◽

User Friendliness ◽

Iterative Search ◽

Potential Applications ◽

Query Generation

Abstract Summary The searching of user-defined 3D queries in molecular interfaces is a computationally challenging problem that is not satisfactorily solved so far. Most of the few existing tools focused on that purpose are desktop based and not openly available. Besides that, they show a lack of query versatility, search efficiency and user-friendliness. We address this issue with GeoMine, a publicly available web application that provides textual, numerical and geometrical search functionality for protein–ligand binding sites derived from structural data contained in the Protein Data Bank (PDB). The query generation is supported by a 3D representation of a start structure that provides interactively selectable elements like atoms, bonds and interactions. GeoMine gives full control over geometric variability in the query while performing a deterministic, precise search. Reasonably selective queries are processed on the entire set of protein–ligand complexes in the PDB within a few minutes. GeoMine offers an interactive and iterative search process of successive result analyses and query adaptations. From the numerous potential applications, we picked two from the field of side-effect analyze showcasing the usefulness of GeoMine. Availability and implementation GeoMine is part of the ProteinsPlus web application suite and freely available at https://proteins.plus. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Crystallography Open Database – an open-access collection of crystal structures

Journal of Applied Crystallography ◽

10.1107/s0021889809016690 ◽

2009 ◽

Vol 42 (4) ◽

pp. 726-729 ◽

Cited By ~ 588

Author(s):

Saulius Gražulis ◽

Daniel Chateigner ◽

Robert T. Downs ◽

A. F. T. Yokochi ◽

Miguel Quirós ◽

...

Keyword(s):

Open Access ◽

Crystal Structures ◽

Organic Molecule ◽

Structural Data ◽

File Format ◽

International Union ◽

Information File ◽

Metal Organic ◽

Crystallography Open Database ◽

Access Model

The Crystallography Open Database (COD), which is a project that aims to gather all available inorganic, metal–organic and small organic molecule structural data in one database, is described. The database adopts an open-access model. The COD currently contains ∼80 000 entries in crystallographic information file format, with nearly full coverage of the International Union of Crystallography publications, and is growing in size and quality.

Download Full-text

PDBMine: A Reformulation of the Protein Data Bank to Facilitate Structural Data Mining

2019 International Conference on Computational Science and Computational Intelligence (CSCI) ◽

10.1109/csci49370.2019.00272 ◽

2019 ◽

Cited By ~ 1

Author(s):

Casey Cole ◽

Christopher Ott ◽

Diego Valdes ◽

Homayoun Valafar

Keyword(s):

Data Mining ◽

Protein Data Bank ◽

Structural Data ◽

Data Bank

Download Full-text