Training data composition affects performance of protein structure analysis algorithms

The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets.

Download Full-text

Solving a new R2lox protein structure by microcrystal electron diffraction

Science Advances ◽

10.1126/sciadv.aax4621 ◽

2019 ◽

Vol 5 (8) ◽

pp. eaax4621 ◽

Cited By ~ 17

Author(s):

Hongyi Xu ◽

Hugo Lebrette ◽

Max T. B. Clabbers ◽

Jingjing Zhao ◽

Julia J. Griese ◽

...

Keyword(s):

Protein Structure ◽

Electron Diffraction ◽

Model Building ◽

Protein Structures ◽

X Ray Diffraction ◽

X Ray ◽

X Ray Crystallography ◽

Potential Map ◽

Metal Cofactor ◽

And Function

Microcrystal electron diffraction (MicroED) has recently shown potential for structural biology. It enables the study of biomolecules from micrometer-sized 3D crystals that are too small to be studied by conventional x-ray crystallography. However, to date, MicroED has only been applied to redetermine protein structures that had already been solved previously by x-ray diffraction. Here, we present the first new protein structure—an R2lox enzyme—solved using MicroED. The structure was phased by molecular replacement using a search model of 35% sequence identity. The resulting electrostatic scattering potential map at 3.0-Å resolution was of sufficient quality to allow accurate model building and refinement. The dinuclear metal cofactor could be located in the map and was modeled as a heterodinuclear Mn/Fe center based on previous studies. Our results demonstrate that MicroED has the potential to become a widely applicable tool for revealing novel insights into protein structure and function.

Download Full-text

Solving the first novel protein structure by 3D micro-crystal electron diffraction

10.1101/600387 ◽

2019 ◽

Cited By ~ 1

Author(s):

H. Xu ◽

H. Lebrette ◽

M.T.B. Clabbers ◽

J. Zhao ◽

J.J. Griese ◽

...

Keyword(s):

Protein Structure ◽

Electron Diffraction ◽

Model Building ◽

Protein Structures ◽

X Ray ◽

X Ray Crystallography ◽

Unknown Protein ◽

Potential Map ◽

Crystal Electron ◽

And Function

AbstractMicro-crystal electron diffraction (MicroED) has recently shown potential for structural biology. It enables studying biomolecules from micron-sized 3D crystals that are too small to be studied by conventional X-ray crystallography. However, to the best of our knowledge, MicroED has only been applied to re-determine protein structures that had already been solved previously by X-ray diffraction. Here we present the first unknown protein structure – an R2lox enzyme – solved using MicroED. The structure was phased by molecular replacement using a search model of 35% sequence identity. The resulting electrostatic scattering potential map at 3.0 Å resolution was of sufficient quality to allow accurate model building and refinement. Our results demonstrate that MicroED has the potential to become a widely applicable tool for revealing novel insights into protein structure and function, opening up new opportunities for structural biologists.

Download Full-text

Protein Structure Analysis and Validation with X-Ray Crystallography

Methods in Molecular Biology - Protein Downstream Processing ◽

10.1007/978-1-0716-0775-6_25 ◽

2020 ◽

pp. 377-404

Author(s):

Anastassios C. Papageorgiou ◽

Nirmal Poudel ◽

Jesse Mattsson

Keyword(s):

Protein Structure ◽

Structure Analysis ◽

Protein Structure Analysis ◽

X Ray ◽

X Ray Crystallography

Download Full-text

PROBABILISTIC ENSEMBLES FOR IMPROVED INFERENCE IN PROTEIN-STRUCTURE DETERMINATION

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720012400094 ◽

2012 ◽

Vol 10 (01) ◽

pp. 1240009 ◽

Cited By ~ 2

Author(s):

AMEET SONI ◽

JUDE SHAVLIK

Keyword(s):

Protein Structure ◽

Structure Determination ◽

Protein Structures ◽

Three Dimensional ◽

Protein Structure Determination ◽

Complex Problem ◽

Approximate Inference ◽

X Ray ◽

X Ray Crystallography ◽

Inference Methods

Protein X-ray crystallography — the most popular method for determining protein structures — remains a laborious process requiring a great deal of manual crystallographer effort to interpret low-quality protein images. Automating this process is critical in creating a high-throughput protein-structure determination pipeline. Previously, our group developed ACMI, a probabilistic framework for producing protein-structure models from electron-density maps produced via X-ray crystallography. ACMI uses a Markov Random Field to model the three-dimensional (3D) location of each non-hydrogen atom in a protein. Calculating the best structure in this model is intractable, so ACMI uses approximate inference methods to estimate the optimal structure. While previous results have shown ACMI to be the state-of-the-art method on this task, its approximate inference algorithm remains computationally expensive and susceptible to errors. In this work, we develop Probabilistic Ensembles in ACMI (PEA), a framework for leveraging multiple, independent runs of approximate inference to produce estimates of protein structures. Our results show statistically significant improvements in the accuracy of inference resulting in more complete and accurate protein structures. In addition, PEA provides a general framework for advanced approximate inference methods in complex problem domains.

Download Full-text

Faculty Opinions recommendation of Comparisons of NMR spectral quality and success in crystallization demonstrate that NMR and X-ray crystallography are complementary methods for small protein structure determination.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1029453.344422 ◽

2005 ◽

Author(s):

Deyou Zheng

Keyword(s):

Protein Structure ◽

Structure Determination ◽

Protein Structure Determination ◽

Spectral Quality ◽

Small Protein ◽

X Ray ◽

Complementary Methods ◽

X Ray Crystallography

Download Full-text

Beyond History: The List of The Most Well Studied Human Protein Structures

10.20944/preprints202008.0655.v1 ◽

2020 ◽

Author(s):

Zhenlu Li ◽

Matthias Buck

Keyword(s):

Protein Structures ◽

Protein Sequences ◽

Human Protein ◽

Current Status ◽

Protein Database ◽

X Ray ◽

X Ray Crystallography ◽

Protein Biophysics ◽

The Relationship ◽

Past Trend

Of 20,000 or so canonical human protein sequences, as of July 2020, 6,747 proteins have had their full or partial medium to high-resolution structures determined by x-ray crystallography or other methods. Which of these proteins dominate the protein database (the PDB) and why? In this paper, we list the 272 top protein structures based on the number of their PDB depositions. This set of proteins accounts for more than 40% of all available human PDB entries and represent past trend and current status for protein science. We briefly discuss the relationship which some of the prominent protein structures have with protein biophysics research and mention their relevance to human diseases. The information may inspire researchers who are new to protein science, but it also provides a year 2020 snap-shot for the state of protein science.

Download Full-text

MRPC (Missing Regions in Polypeptide Chains): a knowledgebase

Journal of Applied Crystallography ◽

10.1107/s1600576719012330 ◽

2019 ◽

Vol 52 (6) ◽

pp. 1422-1426

Author(s):

Rajendran Santhosh ◽

Namrata Bankoti ◽

Adgonda Malgonnavar Padmashri ◽

Daliah Michael ◽

Jeyaraman Jeyakanthan ◽

...

Keyword(s):

Protein Structures ◽

Three Dimensional ◽

Protein Molecule ◽

Data Bank ◽

Protein Crystal ◽

Dimensional Structure ◽

Protein Structure Analysis ◽

Three Dimensional Structure ◽

X Ray Crystallography ◽

Polypeptide Chains

Missing regions in protein crystal structures are those regions that cannot be resolved, mainly owing to poor electron density (if the three-dimensional structure was solved using X-ray crystallography). These missing regions are known to have high B factors and could represent loops with a possibility of being part of an active site of the protein molecule. Thus, they are likely to provide valuable information and play a crucial role in the design of inhibitors and drugs and in protein structure analysis. In view of this, an online database, Missing Regions in Polypeptide Chains (MRPC), has been developed which provides information about the missing regions in protein structures available in the Protein Data Bank. In addition, the new database has an option for users to obtain the above data for non-homologous protein structures (25 and 90%). A user-friendly graphical interface with various options has been incorporated, with a provision to view the three-dimensional structure of the protein along with the missing regions using JSmol. The MRPC database is updated regularly (currently once every three months) and can be accessed freely at the URL http://cluster.physics.iisc.ac.in/mrpc.

Download Full-text

Effect of Reconstruction Algorithm on the Identification of 3D Printing Polymers Based on Hyperspectral CT Technology Combined with Artificial Neural Network

Materials ◽

10.3390/ma13081963 ◽

2020 ◽

Vol 13 (8) ◽

pp. 1963 ◽

Cited By ~ 2

Author(s):

Zheng Fang ◽

Renbin Wang ◽

Mengyi Wang ◽

Shuo Zhong ◽

Liquan Ding ◽

...

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

3D Printing ◽

Principal Components ◽

Reconstruction Technique ◽

Ct Reconstruction ◽

Training Set ◽

X Ray ◽

Test Sets ◽

Artificial Neural

Hyperspectral X-ray CT (HXCT) technology provides not only structural imaging but also the information of material components therein. The main purpose of this study is to investigate the effect of various reconstruction algorithms on reconstructed X-ray absorption spectra (XAS) of components shown in the CT image by means of HXCT. In this paper, taking 3D printing polymer as an example, seven kinds of commonly used polymers such as thermoplastic elastomer (TPE), carbon fiber reinforced polyamide (PA-CF), acrylonitrile butadiene styrene (ABS), polylactic acid (PLA), ultraviolet photosensitive resin (UV9400), polyethylene terephthalate glycol (PETG), and polyvinyl alcohol (PVA) were selected as samples for hyperspectral CT reconstruction experiments. Seven kinds of 3D printing polymer and two interfering samples were divided into a training set and test sets. First, structural images of specimens were reconstructed by Filtered Back-Projection (FBP), Algebra Reconstruction Technique (ART) and Maximum-Likelihood Expectation-Maximization (ML-EM). Secondly, reconstructed XAS were extracted from the pixels of region of interest (ROI) compartmentalized in the images. Thirdly, the results of principal component analysis (PCA) demonstrated that the first four principal components contain the main features of reconstructed XAS, so we adopted Artificial Neural Network (ANN) trained by the reconstructed XAS expressed by the first four principal components in the training set to identify that the XAS of corresponding polymers exist in both of test sets from the training set. The result of ANN displays that FBP has the best performance of classification, whose ten-fold cross-validation accuracy reached 99%. It suggests that hyperspectral CT reconstruction is a promising way of getting image features and material features at the same time, which can be used in medical imaging and nondestructive testing.

Download Full-text

MTD-PLS and docking study for a series of substituted 2-phenylindole derivatives with oestrogenic activity

Chemical Papers ◽

10.2478/s11696-011-0040-3 ◽

2011 ◽

Vol 65 (4) ◽

Author(s):

Edward Seclaman ◽

Alina Bora ◽

Sorin Avram ◽

Zeno Simon ◽

Ludovic Kurunczi

Keyword(s):

Oestrogen Receptor ◽

Scoring Function ◽

Docking Study ◽

Scoring Functions ◽

X Ray Diffraction ◽

X Ray ◽

X Ray Crystallography ◽

Receptor Complexes ◽

Test Sets ◽

Latent Structures

AbstractA series of 36 substituted 2-phenylindoles was analysed using minimal topological difference-projections in latent structures variant (MTD-PLS) and molecular docking, using fast rigid exhaustive docking (FRED) and AutoDock Vina programs. For quantitative structure activity relationships (QSAR) validation, a sphere exclusion algorithm in the multi-dimensional descriptor space was used to construct training and test sets. Docking procedures were based on X-ray crystallography studies using the human alpha oestrogen receptor-17β-oestradiol complex. The ranking abilities of the different scoring functions of the FRED package were presented, and the most suitable scoring function (Chemgauss3) for the oestrogen receptor was chosen. Although the series studied contains only a limited number of compounds, the MTD-PLS method and the docking procedure provided coherent results in concordance with the X-ray diffraction data for different ligand-oestrogen receptor complexes.

Download Full-text