Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features

Abstract X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew’s correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.

Download Full-text

BCrystal: an interpretable sequence-based protein crystallization predictor

Bioinformatics ◽

10.1093/bioinformatics/btz762 ◽

2019 ◽

Vol 36 (5) ◽

pp. 1429-1438 ◽

Cited By ~ 6

Author(s):

Abdurrahman Elbasir ◽

Raghvendra Mall ◽

Khalid Kunji ◽

Reda Rawi ◽

Zeyaul Islam ◽

...

Keyword(s):

Correlation Coefficient ◽

Solvent Accessibility ◽

Protein Crystallization ◽

Protein Structures ◽

Attrition Rate ◽

Supplementary Information ◽

Gradient Boosting ◽

Relative Solvent Accessibility ◽

X Ray Crystallography ◽

Matthew’S Correlation Coefficient

Abstract Motivation X-ray crystallography has facilitated the majority of protein structures determined to date. Sequence-based predictors that can accurately estimate protein crystallization propensities would be highly beneficial to overcome the high expenditure, large attrition rate, and to reduce the trial-and-error settings required for crystallization. Results In this study, we present a novel model, BCrystal, which uses an optimized gradient boosting machine (XGBoost) on sequence, structural and physio-chemical features extracted from the proteins of interest. BCrystal also provides explanations, highlighting the most important features for the predicted crystallization propensity of an individual protein using the SHAP algorithm. On three independent test sets, BCrystal outperforms state-of-the-art sequence-based methods by more than 12.5% in accuracy, 18% in recall and 0.253 in Matthew’s correlation coefficient, with an average accuracy of 93.7%, recall of 96.63% and Matthew’s correlation coefficient of 0.868. For relative solvent accessibility of exposed residues, we observed higher values to associate positively with protein crystallizability and the number of disordered regions, fraction of coils and tripeptide stretches that contain multiple histidines associate negatively with crystallizability. The higher accuracy of BCrystal enables it to accurately screen for sequence variants with enhanced crystallizability. Availability and implementation Our BCrystal webserver is at https://machinelearning-protein.qcri.org/ and source code is available at https://github.com/raghvendra5688/BCrystal. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Magnetic Particles used in a New Approach for Designed Protein Crystallization

CrystEngComm ◽

10.1039/d0ce01529f ◽

2021 ◽

Author(s):

Raquel dos Santos ◽

Maria João Romão ◽

Ana C A Roque ◽

Ana Luisa Moreira Carvalho

Keyword(s):

Structure Determination ◽

Magnetic Particles ◽

Protein Crystallization ◽

Protein Structures ◽

3D Structure ◽

New Approach ◽

X Ray ◽

X Ray Crystallography ◽

3D Structure Determination

After more than one hundred and thirty thousand protein structures determined by X-ray crystallography, the challenge of protein crystallization for 3D structure determination remains. In the quest for additives for...

Download Full-text

Beyond History: The List of The Most Well Studied Human Protein Structures

10.20944/preprints202008.0655.v1 ◽

2020 ◽

Author(s):

Zhenlu Li ◽

Matthias Buck

Keyword(s):

Protein Structures ◽

Protein Sequences ◽

Human Protein ◽

Current Status ◽

Protein Database ◽

X Ray ◽

X Ray Crystallography ◽

Protein Biophysics ◽

The Relationship ◽

Past Trend

Of 20,000 or so canonical human protein sequences, as of July 2020, 6,747 proteins have had their full or partial medium to high-resolution structures determined by x-ray crystallography or other methods. Which of these proteins dominate the protein database (the PDB) and why? In this paper, we list the 272 top protein structures based on the number of their PDB depositions. This set of proteins accounts for more than 40% of all available human PDB entries and represent past trend and current status for protein science. We briefly discuss the relationship which some of the prominent protein structures have with protein biophysics research and mention their relevance to human diseases. The information may inspire researchers who are new to protein science, but it also provides a year 2020 snap-shot for the state of protein science.

Download Full-text

Role of Computational Methods in Going beyond X-ray Crystallography to Explore Protein Structure and Dynamics

International Journal of Molecular Sciences ◽

10.3390/ijms19113401 ◽

2018 ◽

Vol 19 (11) ◽

pp. 3401 ◽

Cited By ~ 16

Author(s):

Ashutosh Srivastava ◽

Tetsuro Nagai ◽

Arpita Srivastava ◽

Osamu Miyashita ◽

Florence Tama

Keyword(s):

Protein Dynamics ◽

Computational Methods ◽

Protein Structures ◽

Three Dimensional ◽

Dimensional Structure ◽

X Ray ◽

X Ray Crystallography ◽

Insight Into

Protein structural biology came a long way since the determination of the first three-dimensional structure of myoglobin about six decades ago. Across this period, X-ray crystallography was the most important experimental method for gaining atomic-resolution insight into protein structures. However, as the role of dynamics gained importance in the function of proteins, the limitations of X-ray crystallography in not being able to capture dynamics came to the forefront. Computational methods proved to be immensely successful in understanding protein dynamics in solution, and they continue to improve in terms of both the scale and the types of systems that can be studied. In this review, we briefly discuss the limitations of X-ray crystallography in studying protein dynamics, and then provide an overview of different computational methods that are instrumental in understanding the dynamics of proteins and biomacromolecular complexes.

Download Full-text

Structure and function of cement proteins in human adenovirus

Acta Crystallographica Section A Foundations and Advances ◽

10.1107/s205327331408396x ◽

2014 ◽

Vol 70 (a1) ◽

pp. C1603-C1603

Author(s):

Vijay Reddy ◽

Glen Nemerow

Keyword(s):

Protein Structures ◽

Human Adenovirus ◽

Icosahedral Symmetry ◽

Virion Assembly ◽

X Ray ◽

X Ray Crystallography ◽

Capsid Shell ◽

Multiple Copies ◽

And Function ◽

Cement Protein

Human adenoviruses (HAdVs) are large (~150nm in diameter, 150MDa) nonenveloped double-stranded DNA (dsDNA) viruses that cause respiratory, ocular, and enteric diseases. The capsid shell of adenovirus (Ad) comprises multiple copies of three major capsid proteins (MCP: hexon, penton base and fiber) and four minor/cement proteins (IIIa, VI, VIII and IX) that are organized with pseudo T=25 icosahedral symmetry. In addition, six other proteins (V, VII, μ, IVa2, terminal protein and protease) are encapsidated along with the 36Kb dsDNA genome inside the capsid. The crystal structures of all three MCPs are known and so is their organization in the capsid from prior X-ray crystallography and cryoEM analyses. However structures and locations of various cement proteins are of considerable debate. We have determined and refined the structure of an entire human adenovirus employing X-ray crystallpgraphic methods at 3.8Å resolution. Adenovirus cement proteins play crucial roles in virion assembly, disassembly, cell entry and infection. Based on the refined crystal structure of adenovirus, we have determined the structure of the cement protein VI, a key membrane-lytic molecule and its associations with proteins V and VIII, which together glue peripentonal hexons beneath vertex region and connect them to rest of the capsid. Following virion maturation, the cleaved N-terminal pro-peptide of VI is observed deep in the peripentonal hexon cavity, detached from the membrane-lytic domain. Furthermore, we have significantly revised the recent cryoEM models for proteins IIIa and IX and both are located on the capsid exterior. Together, the cement proteins exclusively stabilize the hexon shell, thus rendering penton vertices the weakest links of the adenovirus capsid. Adenovirus cement protein structures reveal the molecular basis of the maturation cleavage of VI that is needed for endosome rupture and delivery of the virion into cytoplasm.

Download Full-text

X-RAY CRYSTALLOGRAPHY: Opening the Door to More Membrane Protein Structures

Science ◽

10.1126/science.277.5332.1607 ◽

1997 ◽

Vol 277 (5332) ◽

pp. 1607-1608 ◽

Cited By ~ 3

Author(s):

A. S. Moffat

Keyword(s):

Membrane Protein ◽

Protein Structures ◽

X Ray ◽

X Ray Crystallography

Download Full-text

Bence Jones KWR protein structures determined by X-ray crystallography

Acta Crystallographica Section D Biological Crystallography ◽

10.1107/s0907444907021981 ◽

2007 ◽

Vol 63 (7) ◽

pp. 780-792 ◽

Cited By ~ 7

Author(s):

Debora L. Makino ◽

Agnes H. Henschen-Edman ◽

Steven B. Larson ◽

Alexander McPherson

Keyword(s):

Protein Structures ◽

X Ray ◽

X Ray Crystallography

Download Full-text

xMDFF: molecular dynamics flexible fitting of low-resolution X-ray structures

Acta Crystallographica Section D Biological Crystallography ◽

10.1107/s1399004714013856 ◽

2014 ◽

Vol 70 (9) ◽

pp. 2344-2355 ◽

Cited By ~ 33

Author(s):

Ryan McGreevy ◽

Abhishek Singharoy ◽

Qufei Li ◽

Jingfen Zhang ◽

Dong Xu ◽

...

Keyword(s):

Molecular Dynamics ◽

Large Scale ◽

Protein Structures ◽

Data Bank ◽

Real Space ◽

Low Resolution ◽

X Ray ◽

X Ray Crystallography ◽

Atomic Structures ◽

Electron Density Map

X-ray crystallography remains the most dominant method for solving atomic structures. However, for relatively large systems, the availability of only medium-to-low-resolution diffraction data often limits the determination of all-atom details. A new molecular dynamics flexible fitting (MDFF)-based approach, xMDFF, for determining structures from such low-resolution crystallographic data is reported. xMDFF employs a real-space refinement scheme that flexibly fits atomic models into an iteratively updating electron-density map. It addresses significant large-scale deformations of the initial model to fit the low-resolution density, as tested with synthetic low-resolution maps of D-ribose-binding protein. xMDFF has been successfully applied to re-refine six low-resolution protein structures of varying sizes that had already been submitted to the Protein Data Bank. Finally,viasystematic refinement of a series of data from 3.6 to 7 Å resolution, xMDFF refinements together with electrophysiology experiments were used to validate the first all-atom structure of the voltage-sensing protein Ci-VSP.

Download Full-text

Solving a new R2lox protein structure by microcrystal electron diffraction

Science Advances ◽

10.1126/sciadv.aax4621 ◽

2019 ◽

Vol 5 (8) ◽

pp. eaax4621 ◽

Cited By ~ 17

Author(s):

Hongyi Xu ◽

Hugo Lebrette ◽

Max T. B. Clabbers ◽

Jingjing Zhao ◽

Julia J. Griese ◽

...

Keyword(s):

Protein Structure ◽

Electron Diffraction ◽

Model Building ◽

Protein Structures ◽

X Ray Diffraction ◽

X Ray ◽

X Ray Crystallography ◽

Potential Map ◽

Metal Cofactor ◽

And Function

Microcrystal electron diffraction (MicroED) has recently shown potential for structural biology. It enables the study of biomolecules from micrometer-sized 3D crystals that are too small to be studied by conventional x-ray crystallography. However, to date, MicroED has only been applied to redetermine protein structures that had already been solved previously by x-ray diffraction. Here, we present the first new protein structure—an R2lox enzyme—solved using MicroED. The structure was phased by molecular replacement using a search model of 35% sequence identity. The resulting electrostatic scattering potential map at 3.0-Å resolution was of sufficient quality to allow accurate model building and refinement. The dinuclear metal cofactor could be located in the map and was modeled as a heterodinuclear Mn/Fe center based on previous studies. Our results demonstrate that MicroED has the potential to become a widely applicable tool for revealing novel insights into protein structure and function.

Download Full-text

Faculty Opinions recommendation of Heterogeneity and inaccuracy in protein structures solved by X-ray crystallography.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1018989.215540 ◽

2004 ◽

Author(s):

Matthew Jacobson

Keyword(s):

Protein Structures ◽

X Ray ◽

X Ray Crystallography

Download Full-text