Predicting the performance of automated crystallographic model-building pipelines

Proteins are macromolecules that perform essential biological functions which depend on their three-dimensional structure. Determining this structure involves complex laboratory and computational work. For the computational work, multiple software pipelines have been developed to build models of the protein structure from crystallographic data. Each of these pipelines performs differently depending on the characteristics of the electron-density map received as input. Identifying the best pipeline to use for a protein structure is difficult, as the pipeline performance differs significantly from one protein structure to another. As such, researchers often select pipelines that do not produce the best possible protein models from the available data. Here, a software tool is introduced which predicts key quality measures of the protein structures that a range of pipelines would generate if supplied with a given crystallographic data set. These measures are crystallographic quality-of-fit indicators based on included and withheld observations, and structure completeness. Extensive experiments carried out using over 2500 data sets show that the tool yields accurate predictions for both experimental phasing data sets (at resolutions between 1.2 and 4.0 Å) and molecular-replacement data sets (at resolutions between 1.0 and 3.5 Å). The tool can therefore provide a recommendation to the user concerning the pipelines that should be run in order to proceed most efficiently to a depositable model.

Download Full-text

Solvent Accessibility of Residues Undergoing Pathogenic Variations in Humans: From Protein Structures to Protein Sequences

Frontiers in Molecular Biosciences ◽

10.3389/fmolb.2020.626363 ◽

2021 ◽

Vol 7 ◽

Author(s):

Castrense Savojardo ◽

Matteo Manfredi ◽

Pier Luigi Martelli ◽

Rita Casadio

Keyword(s):

Solvent Accessibility ◽

Protein Structures ◽

Three Dimensional ◽

Protein Sequences ◽

Large Data ◽

Human Protein ◽

Dimensional Structure ◽

Wild Type ◽

Solvent Exposure ◽

Data Set

Solvent accessibility (SASA) is a key feature of proteins for determining their folding and stability. SASA is computed from protein structures with different algorithms, and from protein sequences with machine-learning based approaches trained on solved structures. Here we ask the question as to which extent solvent exposure of residues can be associated to the pathogenicity of the variation. By this, SASA of the wild-type residue acquires a role in the context of functional annotation of protein single-residue variations (SRVs). By mapping variations on a curated database of human protein structures, we found that residues targeted by disease related SRVs are less accessible to solvent than residues involved in polymorphisms. The disease association is not evenly distributed among the different residue types: SRVs targeting glycine, tryptophan, tyrosine, and cysteine are more frequently disease associated than others. For all residues, the proportion of disease related SRVs largely increases when the wild-type residue is buried and decreases when it is exposed. The extent of the increase depends on the residue type. With the aid of an in house developed predictor, based on a deep learning procedure and performing at the state-of-the-art, we are able to confirm the above tendency by analyzing a large data set of residues subjected to variations and occurring in some 12,494 human protein sequences still lacking three-dimensional structure (derived from HUMSAVAR). Our data support the notion that surface accessible area is a distinguished property of residues that undergo variation and that pathogenicity is more frequently associated to the buried property than to the exposed one.

Download Full-text

Prediction of Structural and Functional Aspects of Protein

Advances in Secure Computing, Internet Services, and Applications - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-4666-4940-8.ch016 ◽

2014 ◽

pp. 317-333

Author(s):

Arun G. Ingale

Keyword(s):

Protein Structure ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Tertiary Structure ◽

Protein Structures ◽

Three Dimensional ◽

Dimensional Structure ◽

Sequence Information ◽

Predict Protein Structure ◽

Basic Ideas

To predict the structure of protein from a primary amino acid sequence is computationally difficult. An investigation of the methods and algorithms used to predict protein structure and a thorough knowledge of the function and structure of proteins are critical for the advancement of biology and the life sciences as well as the development of better drugs, higher-yield crops, and even synthetic bio-fuels. To that end, this chapter sheds light on the methods used for protein structure prediction. This chapter covers the applications of modeled protein structures and unravels the relationship between pure sequence information and three-dimensional structure, which continues to be one of the greatest challenges in molecular biology. With this resource, it presents an all-encompassing examination of the problems, methods, tools, servers, databases, and applications of protein structure prediction, giving unique insight into the future applications of the modeled protein structures. In this chapter, current protein structure prediction methods are reviewed for a milieu on structure prediction, the prediction of structural fundamentals, tertiary structure prediction, and functional imminent. The basic ideas and advances of these directions are discussed in detail.

Download Full-text

Data Sets

Hurricane Climatology ◽

10.1093/oso/9780199827633.003.0009 ◽

2013 ◽

Author(s):

James B. Elsner ◽

Thomas H. Jagger

Keyword(s):

Model Building ◽

Careful Analysis ◽

Logical Structure ◽

Fortran Program ◽

Data Sets ◽

Data Set ◽

Near Surface ◽

Wind Speeds ◽

The North ◽

The One

Hurricane data originate from careful analysis of past storms by operational meteorologists. The data include estimates of the hurricane position and intensity at 6-hourly intervals. Information related to landfall time, local wind speeds, damages, and deaths, as well as cyclone size, are included. The data are archived by season. Some effort is needed to make the data useful for hurricane climate studies. In this chapter, we describe the data sets used throughout this book. We show you a work flow that includes importing, interpolating, smoothing, and adding attributes. We also show you how to create subsets of the data. Code in this chapter is more complicated and it can take longer to run. You can skip this material on first reading and continue with model building in Chapter 7. You can return here when you have an updated version of the data that includes the most recent years. Most statistical models in this book use the best-track data. Here we describe these data and provide original source material. We also explain how to smooth and interpolate them. Interpolations are needed for regional hurricane analyses. The best-track data set contains the 6-hourly center locations and intensities of all known tropical cyclones across the North Atlantic basin, including the Gulf of Mexico and Caribbean Sea. The data set is called HURDAT for HURricane DATa. It is maintained by the U.S. National Oceanic and Atmospheric Administration (NOAA) at the National Hurricane Center (NHC). Center locations are given in geographic coordinates (in tenths of degrees) and the intensities, representing the one-minute near-surface (∼10 m) wind speeds, are given in knots (1 kt = .5144 m s−1) and the minimum central pressures are given in millibars (1 mb = 1 hPa). The data are provided in 6-hourly intervals starting at 00 UTC (Universal Time Coordinate). The version of HURDAT file used here contains cyclones over the period 1851 through 2010 inclusive. Information on the history and origin of these data is found in Jarvinen et al (1984). The file has a logical structure that makes it easy to read with a FORTRAN program. Each cyclone contains a header record, a series of data records, and a trailer record.

Download Full-text

Development of a tool-kit for the detection of healthy and injured cardiac tissue based on MR imaging

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2017-0041 ◽

2017 ◽

Vol 3 (2) ◽

pp. 195-198

Author(s):

Philip Westphal ◽

Sebastian Hilbert ◽

Michael Unger ◽

Claire Chalopin

Keyword(s):

Contrast Agent ◽

Cardiac Tissue ◽

Software Tool ◽

Ease Of Use ◽

Free Software ◽

Patient Specific ◽

Initial Contour ◽

Data Sets ◽

Heart Model ◽

Data Set

AbstractPlanning of interventions to treat cardiac arrhythmia requires a 3D patient specific model of the heart. Currently available commercial or free software dedicated to this task have important limitations for routinely use. Automatic algorithms are not robust enough while manual methods are time-consuming. Therefore, the project attempts to develop an optimal software tool. The heart model is generated from preoperative MR data-sets acquired with contrast agent and allows visualisation of damaged cardiac tissue. A requirement in the development of the software tool was the use of semi-automatic functions to be more robust. Once the patient image dataset has been loaded, the user selects a region of interest. Thresholding functions allow selecting the areas of high intensities which correspond to anatomical structures filled with contrast agent, namely cardiac cavities and blood vessels. Thereafter, the target-structure, for example the left ventricle, is coarsely selected by interactively outlining the gross shape. An active contour function adjusts automatically the initial contour to the image content. The result can still be manually improved using fast interaction tools. Finally, possible scar tissue located in the cavity muscle is automatically detected and visualized on the 3D heart model. The model is exported in format which is compatible with interventional devices at hospital. The evaluation of the software tool included two steps. Firstly, a comparison with two free software tools was performed on two image data sets of variable quality. Secondly, six scientists and physicians tested our tool and filled out a questionnaire. The performance of our software tool was visually judged more satisfactory than the free software, especially on the data set of lower quality. Professionals evaluated positively our functionalities regarding time taken, ease of use and quality of results. Improvements would consist in performing the planning based on different MR modalities.

Download Full-text

Protein phasing at non-atomic resolution by combining Patterson andVLDtechniques

Acta Crystallographica Section D Biological Crystallography ◽

10.1107/s139900471401013x ◽

2014 ◽

Vol 70 (7) ◽

pp. 1994-2006 ◽

Cited By ~ 8

Author(s):

Rocco Caliandro ◽

Benedetta Carrozzini ◽

Giovanni Luca Cascarano ◽

Giuliana Comunale ◽

Carmelo Giacovazzo ◽

...

Keyword(s):

Model Building ◽

Protein Structures ◽

Experimental Information ◽

Atomic Resolution ◽

Data Sets ◽

Density Maps ◽

Combined Use ◽

Final Electron ◽

Automatic Model Building

Phasing proteins at non-atomic resolution is still a challenge for anyab initiomethod. A variety of algorithms [Patterson deconvolution, superposition techniques, a cross-correlation function (Cmap), theVLD(vive la difference) approach, the FF function, a nonlinear iterative peak-clipping algorithm (SNIP) for defining the background of a map and thefree lunchextrapolation method] have been combined to overcome the lack of experimental information at non-atomic resolution. The method has been applied to a large number of protein diffraction data sets with resolutions varying from atomic to 2.1 Å, with the condition that S or heavier atoms are present in the protein structure. The applications include the use ofARP/wARPto check the quality of the final electron-density maps in an objective way. The results show that resolution is still the maximum obstacle to protein phasing, but also suggest that the solution of protein structures at 2.1 Å resolution is a feasible, even if still an exceptional, task for the combined set of algorithms implemented in the phasing program. The approach described here is more efficient than the previously described procedures:e.g.the combined use of the algorithms mentioned above is frequently able to provide phases of sufficiently high quality to allow automatic model building. The method is implemented in the current version ofSIR2014.

Download Full-text

SaltSeg: Automatic 3D salt segmentation using a deep convolutional neural network

Interpretation ◽

10.1190/int-2018-0235.1 ◽

2019 ◽

Vol 7 (3) ◽

pp. SE113-SE122 ◽

Cited By ~ 26

Author(s):

Yunzhi Shi ◽

Xinming Wu ◽

Sergey Fomel

Keyword(s):

Large Scale ◽

Model Building ◽

Ground Truth ◽

Velocity Model ◽

Training Data ◽

Data Sets ◽

Validation Data ◽

Data Set ◽

Seismic Image ◽

Data Generator

Salt boundary interpretation is important for the understanding of salt tectonics and velocity model building for seismic migration. Conventional methods consist of computing salt attributes and extracting salt boundaries. We have formulated the problem as 3D image segmentation and evaluated an efficient approach based on deep convolutional neural networks (CNNs) with an encoder-decoder architecture. To train the model, we design a data generator that extracts randomly positioned subvolumes from large-scale 3D training data set followed by data augmentation, then feed a large number of subvolumes into the network while using salt/nonsalt binary labels generated by thresholding the velocity model as ground truth labels. We test the model on validation data sets and compare the blind test predictions with the ground truth. Our results indicate that our method is capable of automatically capturing subtle salt features from the 3D seismic image with less or no need for manual input. We further test the model on a field example to indicate the generalization of this deep CNN method across different data sets.

Download Full-text

Prediction and Analysis of Surface Hydrophobic Residues in Tertiary Structure of Proteins

The Scientific World JOURNAL ◽

10.1155/2014/971258 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 19

Author(s):

Shambhu Malleshappa Gowder ◽

Jhinuk Chatterjee ◽

Tanusree Chaudhuri ◽

Kusum Paul

Keyword(s):

Tertiary Structure ◽

Structural Information ◽

Solvent Accessibility ◽

Conservation Score ◽

Protein Structures ◽

Three Dimensional ◽

Dimensional Structure ◽

Data Set ◽

Hydrophobic Residues ◽

Monomeric Proteins

The analysis of protein structures provides plenty of information about the factors governing the folding and stability of proteins, the preferred amino acids in the protein environment, the location of the residues in the interior/surface of a protein and so forth. In general, hydrophobic residues such as Val, Leu, Ile, Phe, and Met tend to be buried in the interior and polar side chains exposed to solvent. The present work depends on sequence as well as structural information of the protein and aims to understand nature of hydrophobic residues on the protein surfaces. It is based on the nonredundant data set of 218 monomeric proteins. Solvent accessibility of each protein was determined using NACCESS software and then obtained the homologous sequences to understand how well solvent exposed and buried hydrophobic residues are evolutionarily conserved and assigned the confidence scores to hydrophobic residues to be buried or solvent exposed based on the information obtained from conservation score and knowledge of flanking regions of hydrophobic residues. In the absence of a three-dimensional structure, the ability to predict surface accessibility of hydrophobic residues directly from the sequence is of great help in choosing the sites of chemical modification or specific mutations and in the studies of protein stability and molecular interactions.

Download Full-text

TOP: a new method for protein structure comparisons and similarity searches

Journal of Applied Crystallography ◽

10.1107/s0021889899012339 ◽

2000 ◽

Vol 33 (1) ◽

pp. 176-183 ◽

Cited By ~ 149

Author(s):

Guoguang Lu

Keyword(s):

User Interface ◽

Protein Structure ◽

Protein Structures ◽

Three Dimensional ◽

Data Bank ◽

Structure Alignment ◽

Dimensional Structure ◽

Protein Structure Alignment ◽

Protein Structure Analysis ◽

Structure Comparison

In order to facilitate the three-dimensional structure comparison of proteins, software for making comparisons and searching for similarities to protein structures in databases has been developed. The program identifies the residues that share similar positions of both main-chain and side-chain atoms between two proteins. The unique functions of the software also include database processingviaInternet- and Web-based servers for different types of users. The developed method and its friendly user interface copes with many of the problems that frequently occur in protein structure comparisons, such as detecting structurally equivalent residues, misalignment caused by coincident match of Cαatoms, circular sequence permutations, tedious repetition of access, maintenance of the most recent database, and inconvenience of user interface. The program is also designed to cooperate with other tools in structural bioinformatics, such as the 3DB Browser software [Prilusky (1998).Protein Data Bank Q. Newslett.84, 3–4] and the SCOP database [Murzin, Brenner, Hubbard & Chothia (1995).J. Mol. Biol.247, 536–540], for convenient molecular modelling and protein structure analysis. A similarity ranking score of `structure diversity' is proposed in order to estimate the evolutionary distance between proteins based on the comparisons of their three-dimensional structures. The function of the program has been utilized as a part of an automated program for multiple protein structure alignment. In this paper, the algorithm of the program and results of systematic tests are presented and discussed.

Download Full-text

Combining cross-crystal averaging and MRSAD to phase a 4354-amino-acid structure

Acta Crystallographica Section D Structural Biology ◽

10.1107/s2059798315023566 ◽

2016 ◽

Vol 72 (2) ◽

pp. 182-191

Author(s):

Jason Nicholas Busby ◽

J. Shaun Lott ◽

Santosh Panjikar

Keyword(s):

Radiation Damage ◽

Model Building ◽

Data Sets ◽

Low Resolution ◽

Data Set ◽

C Protein ◽

Large Size ◽

Anomalous Signal ◽

Difference Fourier ◽

Combined Data

The B and C proteins from the ABC toxin complex ofYersinia entomophagaform a large heterodimer that cleaves and encapsulates the C-terminal toxin domain of the C protein. Determining the structure of the complex formed by B and the N-terminal region of C was challenging owing to its large size, the non-isomorphism of different crystals and their sensitivity to radiation damage. A native data set was collected to 2.5 Å resolution and a non-isomorphous Ta6Br12-derivative data set was collected that showed strong anomalous signal at low resolution. The tantalum-cluster sites could be found, but the anomalous signal did not extend to a high enough resolution to allow model building. Selenomethionine (SeMet)-derivatized protein crystals were produced, but the high number (60) of SeMet sites and the sensitivity of the crystals to radiation damage made phasing using the SAD or MAD methods difficult. Multiple SeMet data sets were combined to provide 30-fold multiplicity, and the low-resolution phase information from the Ta6Br12data set was transferred to this combined data set by cross-crystal averaging. This allowed the Se atoms to be located in an anomalous difference Fourier map; they were then used inAuto-Rickshawfor multiple rounds of autobuilding and MRSAD.

Download Full-text

Pairwise running of automated crystallographic model-building pipelines

Acta Crystallographica Section D Structural Biology ◽

10.1107/s2059798320010542 ◽

2020 ◽

Vol 76 (9) ◽

pp. 814-823 ◽

Cited By ~ 1

Author(s):

Emad Alharbi ◽

Radu Calinescu ◽

Kevin Cowtan

Keyword(s):

Model Building ◽

Protein Structures ◽

Data Sets ◽

Protein Model

For the last two decades, researchers have worked independently to automate protein model building, and four widely used software pipelines have been developed for this purpose: ARP/wARP, Buccaneer, Phenix AutoBuild and SHELXE. Here, the usefulness of combining these pipelines to improve the built protein structures by running them in pairwise combinations is examined. The results show that integrating these pipelines can lead to significant improvements in structure completeness and R free. In particular, running Phenix AutoBuild after Buccaneer improved structure completeness for 29% and 75% of the data sets that were examined at the original resolution and at a simulated lower resolution, respectively, compared with running Phenix AutoBuild on its own. In contrast, Phenix AutoBuild alone produced better structure completeness than the two pipelines combined for only 7% and 3% of these data sets.

Download Full-text