Macromolecular Structure Databases: Past Progress and Future Challenges

Databases containing macromolecular structure data provide a crystallographer with important tools for use in solving, refining and understanding the functional significance of their protein structures. Given this importance, this paper briefly summarizes past progress by outlining the features of the significant number of relevant databases developed to date. One recent database, PDB+, containing all current and obsolete structures deposited with the Protein Data Bank (PDB) is discussed in more detail. PDB+ has been used to analyze the self-consistency of the current (1 January 1998) corpus of over 7000 structures. A summary of those findings is presented (a full discussion will appear elsewhere) in the form of global and temporal trends within the data. These trends indicate that challenges exist if crystallographers are to provide the community with complete and consistent structural results in the future. It is argued that better information management practices are required to meet these challenges.

Download Full-text

Protein Data Bank: the single global archive for 3D macromolecular structure data

Nucleic Acids Research ◽

10.1093/nar/gky949 ◽

2018 ◽

Vol 47 (D1) ◽

pp. D520-D528 ◽

Cited By ~ 135

Author(s):

◽

Stephen K Burley ◽

Helen M Berman ◽

Charmi Bhikadiya ◽

Chunxiao Bi ◽

...

Keyword(s):

Protein Data Bank ◽

Structure Data ◽

Data Bank ◽

Macromolecular Structure

Download Full-text

Enriched Conformational Sampling of DNA and Proteins with a Hybrid Hamiltonian Derived from the Protein Data Bank

International Journal of Molecular Sciences ◽

10.3390/ijms19113405 ◽

2018 ◽

Vol 19 (11) ◽

pp. 3405 ◽

Cited By ~ 3

Author(s):

Emanuel Peter ◽

Jiří Černý

Keyword(s):

Partition Function ◽

Protein Data Bank ◽

Protein Structures ◽

Data Bank ◽

Weighting Factor ◽

Potential Of Mean Force ◽

Conformational Space ◽

Dynamics Simulation ◽

Conformational Sampling ◽

Speed Increase

In this article, we present a method for the enhanced molecular dynamics simulation of protein and DNA systems called potential of mean force (PMF)-enriched sampling. The method uses partitions derived from the potentials of mean force, which we determined from DNA and protein structures in the Protein Data Bank (PDB). We define a partition function from a set of PDB-derived PMFs, which efficiently compensates for the error introduced by the assumption of a homogeneous partition function from the PDB datasets. The bias based on the PDB-derived partitions is added in the form of a hybrid Hamiltonian using a renormalization method, which adds the PMF-enriched gradient to the system depending on a linear weighting factor and the underlying force field. We validated the method using simulations of dialanine, the folding of TrpCage, and the conformational sampling of the Dickerson–Drew DNA dodecamer. Our results show the potential for the PMF-enriched simulation technique to enrich the conformational space of biomolecules along their order parameters, while we also observe a considerable speed increase in the sampling by factors ranging from 13.1 to 82. The novel method can effectively be combined with enhanced sampling or coarse-graining methods to enrich conformational sampling with a partition derived from the PDB.

Download Full-text

Expanding our knowledge of the protein universe: Modelling of protein structures

Acta Crystallographica Section A Foundations and Advances ◽

10.1107/s2053273314095084 ◽

2014 ◽

Vol 70 (a1) ◽

pp. C491-C491

Author(s):

Jürgen Haas ◽

Alessandro Barbato ◽

Tobias Schmidt ◽

Steven Roth ◽

Andrew Waterhouse ◽

...

Keyword(s):

Computational Modeling ◽

Structure Prediction ◽

Structural Information ◽

Protein Structures ◽

Model Organism ◽

Data Bank ◽

Continuous Model ◽

Structure Modeling ◽

Structure Comparison ◽

Modeling And Prediction

Computational modeling and prediction of three-dimensional macromolecular structures and complexes from their sequence has been a long standing goal in structural biology. Over the last two decades, a paradigm shift has occurred: starting from a large "knowledge gap" between the huge number of protein sequences compared to a small number of experimentally known structures, today, some form of structural information – either experimental or computational – is available for the majority of amino acids encoded by common model organism genomes. Methods for structure modeling and prediction have made substantial progress of the last decades, and template based homology modeling techniques have matured to a point where they are now routinely used to complement experimental techniques. However, computational modeling and prediction techniques often fall short in accuracy compared to high-resolution experimental structures, and it is often difficult to convey the expected accuracy and structural variability of a specific model. Retrospectively assessing the quality of blind structure prediction in comparison to experimental reference structures allows benchmarking the state-of-the-art in structure prediction and identifying areas which need further development. The Critical Assessment of Structure Prediction (CASP) experiment has for the last 20 years assessed the progress in the field of protein structure modeling based on predictions for ca. 100 blind prediction targets per experiment which are carefully evaluated by human experts. The "Continuous Model EvaluatiOn" (CAMEO) project aims to provide a fully automated blind assessment for prediction servers based on weekly pre-released sequences of the Protein Data Bank PDB. CAMEO has been made possible by the development of novel scoring methods such as lDDT, which are robust against domain movements to allow for automated continuous structure comparison without human intervention.

Download Full-text

Conformational variability in proteins bound to single-stranded DNA: a new benchmark for new docking perspectives

10.22541/au.162040366.69255354/v1 ◽

2021 ◽

Author(s):

Dominique MIAS-LUCQUIN ◽

Isaure Chauvot de Beauchêne

Keyword(s):

Protein Data Bank ◽

Conformational Changes ◽

Molecular Interactions ◽

Protein Structures ◽

Data Bank ◽

Computational Docking ◽

Ssdna Binding ◽

Conformational Variability ◽

High Flexibility ◽

Docking Benchmark

We explored the Protein Data-Bank (PDB) to collect protein-ssDNA structures and create a multi-conformational docking benchmark including both bound and unbound protein structures. Due to ssDNA high flexibility when not bound, no ssDNA unbound structure is included. For the 143 groups identified as bound-unbound structures of the same protein , we studied the conformational changes in the protein induced by the ssDNA binding. Moreover, based on several bound or unbound protein structures in some groups, we also assessed the intrinsic conformational variability in either bound or unbound conditions, and compared it to the supposedly binding-induced modifications. This benchmark is, to our knowledge, the first attempt made to peruse available structures of protein – ssDNA interactions to such an extent, aiming to improve computational docking tools dedicated to this kind of molecular interactions.

Download Full-text

Structure-Guided Computational Approaches to Unravel Druggable Proteomic Landscape of Mycobacterium leprae

Frontiers in Molecular Biosciences ◽

10.3389/fmolb.2021.663301 ◽

2021 ◽

Vol 8 ◽

Author(s):

Sundeep Chaitanya Vedithi ◽

Sony Malhotra ◽

Marta Acebrón-García-de-Eulate ◽

Modestas Matusevicius ◽

Pedro Henrique Monteiro Torres ◽

...

Keyword(s):

Drug Discovery ◽

Schwann Cells ◽

Protein Structures ◽

Mycobacterium Leprae ◽

Data Bank ◽

Nerve Damage ◽

Structural Proteomics ◽

Bacterial Survival ◽

Functional Sites

Leprosy, caused by Mycobacterium leprae (M. leprae), is treated with a multidrug regimen comprising Dapsone, Rifampicin, and Clofazimine. These drugs exhibit bacteriostatic, bactericidal and anti-inflammatory properties, respectively, and control the dissemination of infection in the host. However, the current treatment is not cost-effective, does not favor patient compliance due to its long duration (12 months) and does not protect against the incumbent nerve damage, which is a severe leprosy complication. The chronic infectious peripheral neuropathy associated with the disease is primarily due to the bacterial components infiltrating the Schwann cells that protect neuronal axons, thereby inducing a demyelinating phenotype. There is a need to discover novel/repurposed drugs that can act as short duration and effective alternatives to the existing treatment regimens, preventing nerve damage and consequent disability associated with the disease. Mycobacterium leprae is an obligate pathogen resulting in experimental intractability to cultivate the bacillus in vitro and limiting drug discovery efforts to repositioning screens in mouse footpad models. The dearth of knowledge related to structural proteomics of M. leprae, coupled with emerging antimicrobial resistance to all the three drugs in the multidrug therapy, poses a need for concerted novel drug discovery efforts. A comprehensive understanding of the proteomic landscape of M. leprae is indispensable to unravel druggable targets that are essential for bacterial survival and predilection of human neuronal Schwann cells. Of the 1,614 protein-coding genes in the genome of M. leprae, only 17 protein structures are available in the Protein Data Bank. In this review, we discussed efforts made to model the proteome of M. leprae using a suite of software for protein modeling that has been developed in the Blundell laboratory. Precise template selection by employing sequence-structure homology recognition software, multi-template modeling of the monomeric models and accurate quality assessment are the hallmarks of the modeling process. Tools that map interfaces and enable building of homo-oligomers are discussed in the context of interface stability. Other software is described to determine the druggable proteome by using information related to the chokepoint analysis of the metabolic pathways, gene essentiality, homology to human proteins, functional sites, druggable pockets and fragment hotspot maps.

Download Full-text

MRPC (Missing Regions in Polypeptide Chains): a knowledgebase

Journal of Applied Crystallography ◽

10.1107/s1600576719012330 ◽

2019 ◽

Vol 52 (6) ◽

pp. 1422-1426

Author(s):

Rajendran Santhosh ◽

Namrata Bankoti ◽

Adgonda Malgonnavar Padmashri ◽

Daliah Michael ◽

Jeyaraman Jeyakanthan ◽

...

Keyword(s):

Protein Structures ◽

Three Dimensional ◽

Protein Molecule ◽

Data Bank ◽

Protein Crystal ◽

Dimensional Structure ◽

Protein Structure Analysis ◽

Three Dimensional Structure ◽

X Ray Crystallography ◽

Polypeptide Chains

Missing regions in protein crystal structures are those regions that cannot be resolved, mainly owing to poor electron density (if the three-dimensional structure was solved using X-ray crystallography). These missing regions are known to have high B factors and could represent loops with a possibility of being part of an active site of the protein molecule. Thus, they are likely to provide valuable information and play a crucial role in the design of inhibitors and drugs and in protein structure analysis. In view of this, an online database, Missing Regions in Polypeptide Chains (MRPC), has been developed which provides information about the missing regions in protein structures available in the Protein Data Bank. In addition, the new database has an option for users to obtain the above data for non-homologous protein structures (25 and 90%). A user-friendly graphical interface with various options has been incorporated, with a provision to view the three-dimensional structure of the protein along with the missing regions using JSmol. The MRPC database is updated regularly (currently once every three months) and can be accessed freely at the URL http://cluster.physics.iisc.ac.in/mrpc.

Download Full-text

RepeatsDB in 2021: improved data and extended classification for protein tandem repeat structures

Nucleic Acids Research ◽

10.1093/nar/gkaa1097 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D452-D457

Author(s):

Lisanna Paladin ◽

Martina Bevilacqua ◽

Sara Errigo ◽

Damiano Piovesan ◽

Ivan Mičetić ◽

...

Keyword(s):

Protein Data Bank ◽

Tandem Repeat ◽

Tandem Repeats ◽

Classification Scheme ◽

Sequence Similarity ◽

Protein Structures ◽

Hierarchical Classification ◽

Structural Similarity ◽

Data Bank ◽

Similarity Class

Abstract The RepeatsDB database (URL: https://repeatsdb.org/) provides annotations and classification for protein tandem repeat structures from the Protein Data Bank (PDB). Protein tandem repeats are ubiquitous in all branches of the tree of life. The accumulation of solved repeat structures provides new possibilities for classification and detection, but also increasing the need for annotation. Here we present RepeatsDB 3.0, which addresses these challenges and presents an extended classification scheme. The major conceptual change compared to the previous version is the hierarchical classification combining top levels based solely on structural similarity (Class > Topology > Fold) with two new levels (Clan > Family) requiring sequence similarity and describing repeat motifs in collaboration with Pfam. Data growth has been addressed with improved mechanisms for browsing the classification hierarchy. A new UniProt-centric view unifies the increasingly frequent annotation of structures from identical or similar sequences. This update of RepeatsDB aligns with our commitment to develop a resource that extracts, organizes and distributes specialized information on tandem repeat protein structures.

Download Full-text

Improving the quality of NMR and crystallographic protein structures by means of a conformational database potential derived from structure databases

Protein Science ◽

10.1002/pro.5560050609 ◽

1996 ◽

Vol 5 (6) ◽

pp. 1067-1080 ◽

Cited By ~ 147

Author(s):

John Kuszewski ◽

Angela M. Gronenborn ◽

G. Marius Clore

Keyword(s):

Protein Structures ◽

Structure Databases

Download Full-text

Insights on the relationship between total grazing pressure management and sustainable land management: key indicators to verify impacts

The Rangeland Journal ◽

10.1071/rj19078 ◽

2019 ◽

Vol 41 (6) ◽

pp. 535 ◽

Cited By ~ 6

Author(s):

C. M. Waters ◽

S. E. McDonald ◽

J. Reseigh ◽

R. Grant ◽

D. G. Burnside

Keyword(s):

Land Management ◽

Management Practices ◽

Temporal Trends ◽

Grazing Intensity ◽

Grazing Pressure ◽

Environmental Stewardship ◽

Grazing Management ◽

Sustainable Land Management ◽

Key Indicators ◽

The Impact

Demonstrating sustainable land management (SLM) requires an understanding of the linkages between grazing management and environmental stewardship. Grazing management practices that incorporate strategic periods of rest are promoted internationally as best practice. However, spatial and temporal trends in unmanaged feral (goat) and native (kangaroo) populations in the southern Australian rangelands can result land managers having, at times, control over less than half the grazing pressure, precluding the ability to rest pastures. Few empirical studies have examined the impacts of total grazing pressure (TGP) on biodiversity and resource condition, while the inability to manage grazing intensity at critical times may result in negative impacts on ground cover, changes in pasture species composition, increased rates of soil loss and reduce the ability for soils to store carbon. The widespread adoption of TGP control through exclusion fencing in the southern Australian rangelands has created unprecedented opportunities to manage total grazing pressure, although there is little direct evidence that this infrastructure leads to more sustainable land management. Here we identify several key indicators that are either outcome- or activity-based that could serve as a basis for verification of the impacts of TGP management. Since TGP is the basic determinant of the impact of herbivory on vegetation it follows that the ability for rangeland pastoral management to demonstrate SLM and environmental stewardship will rely on using evidence-based indicators to support environmental social licence to operate.

Download Full-text

Accurate Representation of Protein-Ligand Structural Diversity in the Protein Data Bank (PDB)

International Journal of Molecular Sciences ◽

10.3390/ijms21062243 ◽

2020 ◽

Vol 21 (6) ◽

pp. 2243

Author(s):

Nicolas K. Shinada ◽

Peter Schmidtke ◽

Alexandre G. de Brevern

Keyword(s):

Protein Data Bank ◽

Protein Sequence ◽

Large Scale ◽

Protein Structures ◽

Structural Diversity ◽

Data Bank ◽

Protein Distribution ◽

Research Areas ◽

Identity Threshold ◽

Protein Sequence Identity

The number of available protein structures in the Protein Data Bank (PDB) has considerably increased in recent years. Thanks to the growth of structures and complexes, numerous large-scale studies have been done in various research areas, e.g., protein–protein, protein–DNA, or in drug discovery. While protein redundancy was only simply managed using simple protein sequence identity threshold, the similarity of protein-ligand complexes should also be considered from a structural perspective. Hence, the protein-ligand duplicates in the PDB are widely known, but were never quantitatively assessed, as they are quite complex to analyze and compare. Here, we present a specific clustering of protein-ligand structures to avoid bias found in different studies. The methodology is based on binding site superposition, and a combination of weighted Root Mean Square Deviation (RMSD) assessment and hierarchical clustering. Repeated structures of proteins of interest are highlighted and only representative conformations were conserved for a non-biased view of protein distribution. Three types of cases are described based on the number of distinct conformations identified for each complex. Defining these categories decreases by 3.84-fold the number of complexes, and offers more refined results compared to a protein sequence-based method. Widely distinct conformations were analyzed using normalized B-factors. Furthermore, a non-redundant dataset was generated for future molecular interactions analysis or virtual screening studies.

Download Full-text