ValiFrag: Evaluating fragment quality during automated protein model building

2014 ◽  
Vol 70 (a1) ◽  
pp. C1443-C1443
Author(s):  
Joana Pereira ◽  
Tim Wiegels ◽  
Victor Lamzin

X-ray diffraction data from flexible macromolecules and their complexes can rarely be measured to a resolution better than 3 Å. Due to a loss of detectable atomic features, the determination of low-resolution structures is beyond the current operational range of crystallographic software and requires a large amount of manual intervention. ARP/wARP [1] v7.4 generates structures that are up to 80% complete at 3.0 Å, but the completeness drops sharply as the resolution gets worse. Reduction of the model completeness is accompanied with an increase in the number of fragments built, which become shorter. Such fragments are applicable for further model building if they are correct. Though, if they are wrong they may cause the formation of incorrectly built regions in the final model. Thus, there is a need to improve fragment quality before automated model completion is applied. We exploit the vast amount of structural information deposited in the Protein Data Bank (PDB) [2], to make use of it for structural validation of built fragments. Precisely, we evaluate the conformation of each fragment. If the conformation is present in several different protein models in the PDB, it is likely to be modelled correctly in the built model and is accepted. If, on the contrary, it cannot be found in any PDB model, it is probably incorrect. Here we present the software implementation of this validation, called ValiFrag, which checks the validity of automatically built protein chain fragments by evaluating their occurrence in the PDB. Protein models from the PDB were broken into dipeptides and conformational parameters for each of these were then stored in a database. For each automatically built fragment, ValiFrag computes the probability of it to be correct according to the conformation of all possible dipeptides. It can, therefore, assess which fragments are likely to be structurally incorrect and should possibly be modified, or even removed, to improve the final model.

2014 ◽  
Vol 70 (a1) ◽  
pp. C327-C327
Author(s):  
Nathaniel Echols ◽  
Nader Morshed ◽  
Nigel Moriarty ◽  
Pavel Afonine ◽  
Thomas Terwilliger ◽  
...  

Although macromolecular crystallography has been greatly accelerated by the development of automated software for data processing, phasing, and model building, most structures require significant manual intervention to yield a truly final model. In addition to missing individual protein or nucleic residues, this may include the addition of alternate conformations, ligands (both free and covalently bound), elemental ions, or modified amino acids. We have developed a number of tools to streamline several of these steps within the Phenix software suite (Adams et al. 2010): 1) an automated pipeline for the determination of ligand-bound structures by molecular replacement (Echols et al. 2014a); 2) placement of elemental ions during refinement (Echols et al. 2014b), as an extension of solvent placement; 3) fitting of additional conformations of protein residues into difference density. These tools reliably reproduce published structures in a majority of test cases, and in several instances identify details omitted by the original authors. Their low false positive rate makes them suitable for use in high-throughput workflows.


2020 ◽  
Vol 76 (3) ◽  
pp. 248-260 ◽  
Author(s):  
Grzegorz Chojnowski ◽  
Koushik Choudhury ◽  
Philipp Heuser ◽  
Egor Sobolev ◽  
Joana Pereira ◽  
...  

The performance of automated protein model building usually decreases with resolution, mainly owing to the lower information content of the experimental data. This calls for a more elaborate use of the available structural information about macromolecules. Here, a new method is presented that uses structural homologues to improve the quality of protein models automatically constructed using ARP/wARP. The method uses local structural similarity between deposited models and the model being built, and results in longer main-chain fragments that in turn can be more reliably docked to the protein sequence. The application of the homology-based model extension method to the example of a CFA synthase at 2.7 Å resolution resulted in a more complete model with almost all of the residues correctly built and docked to the sequence. The method was also evaluated on 1493 molecular-replacement solutions at a resolution of 4.0 Å and better that were submitted to the ARP/wARP web service for model building. A significant improvement in the completeness and sequence coverage of the built models has been observed.


2014 ◽  
Vol 70 (a1) ◽  
pp. C491-C491
Author(s):  
Jürgen Haas ◽  
Alessandro Barbato ◽  
Tobias Schmidt ◽  
Steven Roth ◽  
Andrew Waterhouse ◽  
...  

Computational modeling and prediction of three-dimensional macromolecular structures and complexes from their sequence has been a long standing goal in structural biology. Over the last two decades, a paradigm shift has occurred: starting from a large "knowledge gap" between the huge number of protein sequences compared to a small number of experimentally known structures, today, some form of structural information – either experimental or computational – is available for the majority of amino acids encoded by common model organism genomes. Methods for structure modeling and prediction have made substantial progress of the last decades, and template based homology modeling techniques have matured to a point where they are now routinely used to complement experimental techniques. However, computational modeling and prediction techniques often fall short in accuracy compared to high-resolution experimental structures, and it is often difficult to convey the expected accuracy and structural variability of a specific model. Retrospectively assessing the quality of blind structure prediction in comparison to experimental reference structures allows benchmarking the state-of-the-art in structure prediction and identifying areas which need further development. The Critical Assessment of Structure Prediction (CASP) experiment has for the last 20 years assessed the progress in the field of protein structure modeling based on predictions for ca. 100 blind prediction targets per experiment which are carefully evaluated by human experts. The "Continuous Model EvaluatiOn" (CAMEO) project aims to provide a fully automated blind assessment for prediction servers based on weekly pre-released sequences of the Protein Data Bank PDB. CAMEO has been made possible by the development of novel scoring methods such as lDDT, which are robust against domain movements to allow for automated continuous structure comparison without human intervention.


2021 ◽  
Author(s):  
Joseph H. Lubin ◽  
Christopher Markosian ◽  
D. Balamurugan ◽  
Renata Pasqualini ◽  
Wadih Arap ◽  
...  

There is enormous ongoing interest in characterizing the binding properties of the SARS-CoV-2 Omicron Variant of Concern (VOC) (B.1.1.529), which continues to spread towards potential dominance worldwide. To aid these studies, based on the wealth of available structural information about several SARS-CoV-2 variants in the Protein Data Bank (PDB) and a modeling pipeline we have previously developed for tracking the ongoing global evolution of SARS-CoV-2 proteins, we provide a set of computed structural models (henceforth models) of the Omicron VOC receptor-binding domain (omRBD) bound to its corresponding receptor Angiotensin-Converting Enzyme (ACE2) and a variety of therapeutic entities, including neutralizing and therapeutic antibodies targeting previously-detected viral strains. We generated bound omRBD models using both experimentally-determined structures in the PDB as well as machine learning-based structure predictions as starting points. Examination of ACE2-bound omRBD models reveals an interdigitated multi-residue interaction network formed by omRBD-specific substituted residues (R493, S496, Y501, R498) and ACE2 residues at the interface, which was not present in the original Wuhan-Hu-1 RBD-ACE2 complex. Emergence of this interaction network suggests optimization of a key region of the binding interface, and positive cooperativity among various sites of residue substitutions in omRBD mediating ACE2 binding. Examination of neutralizing antibody complexes for Barnes Class 1 and Class 2 antibodies modeled with omRBD highlights an overall loss of interfacial interactions (with gain of new interactions in rare cases) mediated by substituted residues. Many of these substitutions have previously been found to independently dampen or even ablate antibody binding, and perhaps mediate antibody-mediated neutralization escape (e.g., K417N). We observe little compensation of corresponding interaction loss at interfaces when potential escape substitutions occur in combination. A few selected antibodies (e.g., Barnes Class 3 S309), however, feature largely unaltered or modestly affected protein-protein interfaces. While we stress that only qualitative insights can be obtained directly from our models at this time, we anticipate that they can provide starting points for more detailed and quantitative computational characterization, and, if needed, redesign of monoclonal antibodies for targeting the Omicron VOC Spike protein. In the broader context, the computational pipeline we developed provides a framework for rapidly and efficiently generating retrospective and prospective models for other novel variants of SARS-CoV-2 bound to entities of virological and therapeutic interest, in the setting of a global pandemic.


Author(s):  
Miroslaw Gilski ◽  
Jianbo Zhao ◽  
Marcin Kowiel ◽  
Dariusz Brzezinski ◽  
Douglas H. Turner ◽  
...  

Geometrical restraints provide key structural information for the determination of biomolecular structures at lower resolution by experimental methods such as crystallography or cryo-electron microscopy. In this work, restraint targets for nucleic acids bases are derived from three different sources and compared: small-molecule crystal structures in the Cambridge Structural Database (CSD), ultrahigh-resolution structures in the Protein Data Bank (PDB) and quantum-mechanical (QM) calculations. The best parameters are those based on CSD structures. After over two decades, the standard library of Parkinson et al. [(1996), Acta Cryst. D52, 57–64] is still valid, but improvements are possible with the use of the current CSD database. The CSD-derived geometry is fully compatible with Watson–Crick base pairs, as comparisons with QM results for isolated and paired bases clearly show that the CSD targets closely correspond to proper base pairing. While the QM results are capable of distinguishing between single and paired bases, their level of accuracy is, on average, nearly two times lower than for the CSD-derived targets when gauged by root-mean-square deviations from ultrahigh-resolution structures in the PDB. Nevertheless, the accuracy of QM results appears sufficient to provide stereochemical targets for synthetic base pairs where no reliable experimental structural information is available. To enable future tests for this approach, QM calculations are provided for isocytosine, isoguanine and the iCiG base pair.


2021 ◽  
Author(s):  
Haibin Di ◽  
Chakib Kada Kloucha ◽  
Cen Li ◽  
Aria Abubakar ◽  
Zhun Li ◽  
...  

Abstract Delineating seismic stratigraphic features and depositional facies is of importance to successful reservoir mapping and identification in the subsurface. Robust seismic stratigraphy interpretation is confronted with two major challenges. The first one is to maximally automate the process particularly with the increasing size of seismic data and complexity of target stratigraphies, while the second challenge is to efficiently incorporate available structures into stratigraphy model building. Machine learning, particularly convolutional neural network (CNN), has been introduced into assisting seismic stratigraphy interpretation through supervised learning. However, the small amount of available expert labels greatly restricts the performance of such supervised CNN. Moreover, most of the exiting CNN implementations are based on only amplitude, which fails to use necessary structural information such as faults for constraining the machine learning. To resolve both challenges, this paper presents a semi-supervised learning workflow for fault-guided seismic stratigraphy interpretation, which consists of two components. The first component is seismic feature engineering (SFE), which aims at learning the provided seismic and fault data through a unsupervised convolutional autoencoder (CAE), while the second one is stratigraphy model building (SMB), which aims at building an optimal mapping function between the features extracted from the SFE CAE and the target stratigraphic labels provided by an experienced interpreter through a supervised CNN. Both components are connected by embedding the encoder of the SFE CAE into the SMB CNN, which forces the SMB learning based on these features commonly existing in the entire study area instead of those only at the limited training data; correspondingly, the risk of overfitting is greatly eliminated. More innovatively, the fault constraint is introduced by customizing the SMB CNN of two output branches, with one to match the target stratigraphies and the other to reconstruct the input fault, so that the fault continues contributing to the process of SMB learning. The performance of such fault-guided seismic stratigraphy interpretation is validated by an application to a real seismic dataset, and the machine prediction not only matches the manual interpretation accurately but also clearly illustrates the depositional process in the study area.


2021 ◽  
Vol 40 (5) ◽  
pp. 324-334
Author(s):  
Rongxin Huang ◽  
Zhigang Zhang ◽  
Zedong Wu ◽  
Zhiyuan Wei ◽  
Jiawei Mei ◽  
...  

Seismic imaging using full-wavefield data that includes primary reflections, transmitted waves, and their multiples has been the holy grail for generations of geophysicists. To be able to use the full-wavefield data effectively requires a forward-modeling process to generate full-wavefield data, an inversion scheme to minimize the difference between modeled and recorded data, and, more importantly, an accurate velocity model to correctly propagate and collapse energy of different wave modes. All of these elements have been embedded in the framework of full-waveform inversion (FWI) since it was proposed three decades ago. However, for a long time, the application of FWI did not find its way into the domain of full-wavefield imaging, mostly owing to the lack of data sets with good constraints to ensure the convergence of inversion, the required compute power to handle large data sets and extend the inversion frequency to the bandwidth needed for imaging, and, most significantly, stable FWI algorithms that could work with different data types in different geologic settings. Recently, with the advancement of high-performance computing and progress in FWI algorithms at tackling issues such as cycle skipping and amplitude mismatch, FWI has found success using different data types in a variety of geologic settings, providing some of the most accurate velocity models for generating significantly improved migration images. Here, we take a step further to modify the FWI workflow to output the subsurface image or reflectivity directly, potentially eliminating the need to go through the time-consuming conventional seismic imaging process that involves preprocessing, velocity model building, and migration. Compared with a conventional migration image, the reflectivity image directly output from FWI often provides additional structural information with better illumination and higher signal-to-noise ratio naturally as a result of many iterations of least-squares fitting of the full-wavefield data.


Author(s):  
David Blow

The result of all the work described in the previous chapters will be a set of coordinates and other data suitable for deposit in the Protein Data Bank. You or I may use these coordinates, and we need to have some insight into their accuracy and reliability. In the previous chapters, indicators have been described, which may suggest aspects of the data or interpretation procedures that might lead to problems. But as the determination of protein crystal structures becomes more routine, many of these indicators are omitted from publications. Fortunately, crystallographic procedures are self-checking to a large extent. It is rare for a major error of interpretation to lead right through to a published refined structure. A high Rfree factor is a warning, especially if coupled with departures from the requirements of correct bond lengths, angles, and acceptable dihedral angles. On the other hand, there will always be a desire to squeeze more results from the data. All interpretations are subject to error; nearly all protein crystals have regions that are less ordered, where accurate interpretation is less feasible; and the structure may be overrefined, using too many variables for the data. If the majority of the molecule is correctly interpreted, a reasonable R factor may be obtained even though some small regions are completely wrong. During refinement it is usual to restrain the bond lengths and bond angles to be near their theoretical values, as described in Chapter 12. The extent to which bond lengths and bond angles depart from these values is often quoted as an indicator of accuracy. These departures are, however, difficult to interpret because they depend on how tightly the restraints have been applied. The same applies to the restraint of certain coordinates to lie in a plane. This difficulty illustrates a general problem. Designers of refinement procedures are understandably anxious to improve their procedures to lead directly to a well-refined structure. Every aspect of structure that can be recognized as having a regularity could, in principle, be expressed as a restraint which enforces it during refinement.


2019 ◽  
Vol 63 (5) ◽  
Author(s):  
Vivek Keshri ◽  
Seydina M. Diene ◽  
Adrien Estienne ◽  
Justine Dardaillon ◽  
Olivier Chabrol ◽  
...  

ABSTRACT β-Lactamase enzymes have attracted substential medical attention from researchers and clinicians because of their clinical, ecological, and evolutionary interest. Here, we present a comprehensive online database of β-lactamase enzymes. The current database is manually curated and incorporates the primary amino acid sequences, closest structural information in an external structure database (the Protein Data Bank [PDB]) and the functional profiles and phylogenetic trees of the four molecular classes (A, B, C, and D) of β-lactamases. The functional profiles are presented according to the MICs and kinetic parameters that make them more useful for the investigators. Here, a total of 1,147 β-lactam resistance genes are analyzed and described in the database. The database is implemented in MySQL and the related website is developed with Zend Framework 2 on an Apache server, supporting all major web browsers. Users can easily retrieve and visualize biologically important information using a set of efficient queries from a graphical interface. This database is freely accessible at http://ifr48.timone.univ-mrs.fr/beta-lactamase/public/.


2007 ◽  
Vol 02 (03n04) ◽  
pp. 267-271
Author(s):  
ZOLTÁN SZABADKA ◽  
RAFAEL ÖRDÖG ◽  
VINCE GROLMUSZ

The Protein Data Bank (PDB) is the most important depository of protein structural information, containing more than 45,000 deposited entries today. Because of its inhomogeneous structure, its fully automated processing is almost impossible. In a previous work, we cleaned and re-structured the entries in the Protein Data Bank, and from the result we have built the RS-PDB database. Using the RS-PDB database, we draw a Ramachandran-plot from 6,593 "perfect" polypeptide chains found in the PDB, containing 1,192,689 residues. This is a more than tenfold increase in the size of data analyzed before this work. The density of the data points makes it possible to draw a logarithmic heat map enhanced Ramachandran map, showing the fine inner structure of the right-handed α-helix region.


Sign in / Sign up

Export Citation Format

Share Document