scholarly journals BCrystal: an interpretable sequence-based protein crystallization predictor

2019 ◽  
Vol 36 (5) ◽  
pp. 1429-1438 ◽  
Author(s):  
Abdurrahman Elbasir ◽  
Raghvendra Mall ◽  
Khalid Kunji ◽  
Reda Rawi ◽  
Zeyaul Islam ◽  
...  

Abstract Motivation X-ray crystallography has facilitated the majority of protein structures determined to date. Sequence-based predictors that can accurately estimate protein crystallization propensities would be highly beneficial to overcome the high expenditure, large attrition rate, and to reduce the trial-and-error settings required for crystallization. Results In this study, we present a novel model, BCrystal, which uses an optimized gradient boosting machine (XGBoost) on sequence, structural and physio-chemical features extracted from the proteins of interest. BCrystal also provides explanations, highlighting the most important features for the predicted crystallization propensity of an individual protein using the SHAP algorithm. On three independent test sets, BCrystal outperforms state-of-the-art sequence-based methods by more than 12.5% in accuracy, 18% in recall and 0.253 in Matthew’s correlation coefficient, with an average accuracy of 93.7%, recall of 96.63% and Matthew’s correlation coefficient of 0.868. For relative solvent accessibility of exposed residues, we observed higher values to associate positively with protein crystallizability and the number of disordered regions, fraction of coils and tripeptide stretches that contain multiple histidines associate negatively with crystallizability. The higher accuracy of BCrystal enables it to accurately screen for sequence variants with enhanced crystallizability. Availability and implementation Our BCrystal webserver is at https://machinelearning-protein.qcri.org/ and source code is available at https://github.com/raghvendra5688/BCrystal. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Yi-Heng Zhu ◽  
Jun Hu ◽  
Fang Ge ◽  
Fuyi Li ◽  
Jiangning Song ◽  
...  

Abstract X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew’s correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.


2020 ◽  
Author(s):  
Azadeh Alavi ◽  
David B. Ascher

AbstractThe key method for determining the structure of a protein to date is X-ray crystallography, which is a very expensive technique that suffers from high attrition rate. On the contrary, a sequence-based predictor that is capable of accurately determining protein crystallization property, would not only overcome such limitations, but also would reduce the trial-and-error settings required to perform crystallization. In this work, to predict protein crystallizability, we have developed a novel sequence-based hybrid method that employs two separate, yet fully automated, concepts for extracting features from protein sequences. Specifically, we use a deep convolutional neural network on a publicly available dataset to extract descriptive features directly from the sequences, then fuse such feature with structural-and-physio-chemical driven features (such as amino-acid composition or AAIndex-based physicochemical properties). Dimentionality reduction is then performed on the resulting features and the output vectors are applied to train optimized gradient boosting machine (XGBoostt). We evaluate our method through three publicly available test sets, and show that our proposed DHS-Crystallize algorithm outperforms state-of-the-art methods, and achieves higher performance compared to using DCNN-deriven features, or structural-and-physio-chemical driven features alone.


2019 ◽  
Author(s):  
Raghvendra Mall

AbstractMotivationProtein solubility is a property associated with protein expression and is a critical determinant of the manufacturability of therapeutic proteins. It is thus imperative to design accurate in-silico sequence-based solubility predictors.MethodsIn this study, we propose SolXplain, an extreme gradient boosting machine based protein solubility predictor which achieves state-of-the-art performance using physio-chemical, sequence and novel structure derived features from protein sequences. Moreover, SolXplain has a unique attribute that it can provide explanation for the predicted class label for each test protein based on its corresponding feature values using SHapley Additive exPlanations (SHAP) method.ResultsBased on an independent test set, SolXplain outperformed other sequence-based methods by at least 2% in accuracy and 2% in Matthew’s correlation coefficient, with an overall accuracy of 78% and Matthew’s correlation coefficient of 0.56. Additionally, for fractions of exposed residues (FER) at various residual solvent accessibility (RSA) cutoffs, we observed higher fractions to associate positively with protein solubility, and tripeptide stretches that contain one isoleucine and one or more histidines, to associate negatively with solubility. The improved prediction accuracy of SolXplain enables it to predict protein solubility with greater consistency and screen for sequences with enhanced manufacturability.


2020 ◽  
Vol 36 (11) ◽  
pp. 3372-3378
Author(s):  
Alexander Gress ◽  
Olga V Kalinina

Abstract Motivation In proteins, solvent accessibility of individual residues is a factor contributing to their importance for protein function and stability. Hence one might wish to calculate solvent accessibility in order to predict the impact of mutations, their pathogenicity and for other biomedical applications. A direct computation of solvent accessibility is only possible if all atoms of a protein three-dimensional structure are reliably resolved. Results We present SphereCon, a new precise measure that can estimate residue relative solvent accessibility (RSA) from limited data. The measure is based on calculating the volume of intersection of a sphere with a cone cut out in the direction opposite of the residue with surrounding atoms. We propose a method for estimating the position and volume of residue atoms in cases when they are not known from the structure, or when the structural data are unreliable or missing. We show that in cases of reliable input structures, SphereCon correlates almost perfectly with the directly computed RSA, and outperforms other previously suggested indirect methods. Moreover, SphereCon is the only measure that yields accurate results when the identities of amino acids are unknown. A significant novel feature of SphereCon is that it can estimate RSA from inter-residue distance and contact matrices, without any information about the actual atom coordinates. Availability and implementation https://github.com/kalininalab/spherecon. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (12) ◽  
pp. 3897-3898
Author(s):  
Mirko Torrisi ◽  
Gianluca Pollastri

Abstract Motivation Protein structural annotations (PSAs) are essential abstractions to deal with the prediction of protein structures. Many increasingly sophisticated PSAs have been devised in the last few decades. However, the need for annotations that are easy to compute, process and predict has not diminished. This is especially true for protein structures that are hardest to predict, such as novel folds. Results We propose Brewery, a suite of ab initio predictors of 1D PSAs. Brewery uses multiple sources of evolutionary information to achieve state-of-the-art predictions of secondary structure, structural motifs, relative solvent accessibility and contact density. Availability and implementation The web server, standalone program, Docker image and training sets of Brewery are available at http://distilldeep.ucd.ie/brewery/. Contact [email protected]


2018 ◽  
Vol 35 (13) ◽  
pp. 2216-2225 ◽  
Author(s):  
Abdurrahman Elbasir ◽  
Balasubramanian Moovarkumudalvan ◽  
Khalid Kunji ◽  
Prasanna R Kolatkar ◽  
Raghvendra Mall ◽  
...  

Abstract Motivation Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico methods have been developed to predict crystallization propensities of proteins based on their sequences. However, the majority of these methods build their predictors by extracting features from protein sequences, which is computationally expensive and can explode the feature space. We propose DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction-quality crystals without the need to manually engineer additional biochemical and structural features from sequence. Our model is based on convolutional neural networks, which can exploit frequently occurring k-mers and sets of k-mers from the protein sequences to distinguish proteins that will result in diffraction-quality crystals from those that will not. Results Our model surpasses previous sequence-based protein crystallization predictors in terms of recall, F-score, accuracy and Matthew’s correlation coefficient (MCC) on three independent test sets. DeepCrystal achieves an average improvement of 1.4, 12.1% in recall, when compared to its closest competitors, Crysalis II and Crysf, respectively. In addition, DeepCrystal attains an average improvement of 2.1, 6.0% for F-score, 1.9, 3.9% for accuracy and 3.8, 7.0% for MCC w.r.t. Crysalis II and Crysf on independent test sets. Availability and implementation The standalone source code and models are available at https://github.com/elbasir/DeepCrystal and a web-server is also available at https://deeplearning-protein.qcri.org. Supplementary information Supplementary data are available at Bioinformatics online.


CrystEngComm ◽  
2021 ◽  
Author(s):  
Raquel dos Santos ◽  
Maria João Romão ◽  
Ana C A Roque ◽  
Ana Luisa Moreira Carvalho

After more than one hundred and thirty thousand protein structures determined by X-ray crystallography, the challenge of protein crystallization for 3D structure determination remains. In the quest for additives for...


2014 ◽  
Vol 70 (a1) ◽  
pp. C1144-C1144
Author(s):  
Areej Abuhammad ◽  
Michael McDonough ◽  
Jürgen Brem ◽  
Christopher Schofield ◽  
Elspeth Garman

Protein structures have significantly impacted and aided drug discovery efforts. However, it is not enough to know the structure of a protein; it must be the right structure. Small alteration in sequence can lead to different conformations and oligomerization states, cause changes which lead to different active site architecture and also which modify function. Protein crystallization is an essential prerequisite for the determination of protein structures by X-ray crystallography. We have obtained encouraging initial results for a hitherto unexplored crystallization method with the enzyme arylamine N-acetyltransferase from M. tuberculosis (TBNAT). Despite prolonged and varied trials to crystallize TBNAT, an important anti-tubercular drug target, no crystals were obtained. In an alternative approach, cross-seeding of TBNAT protein with micro-crystalline seeds from a homologous NAT from M. marinum (74 % sequence identity (SID)) surprisingly resulted in a single 20 micron sized TBNAT crystal that diffracted to 2.1 Å and allowed for TBNAT structure determination (Abuhammad et al., 2013). To our knowledge, cross-seeding crystallisation using homologous proteins has only been previously successful in cases with more than 85% SID. In this study, we have explored the effect of low sequence homology on cross seeding using β-lactamases with SID as low as 30%. Despite the low SIDs, the results show cross seeding leads to an increase in hits obtained, the identification of new crystallization conditions, shortening of crystallization time and an improvement in the quality of the crystals obtained.


2014 ◽  
Author(s):  
Amir Shahmoradi ◽  
Dariya K. Sydykova ◽  
Stephanie J. Spielman ◽  
Eleisha L. Jackson ◽  
Eric T. Dawson ◽  
...  

Several recent works have shown that protein structure can predict site-specific evolutionary sequence variation. In particular, sites that are buried and/or have many contacts with other sites in a structure have been shown to evolve more slowly, on average, than surface sites with few contacts. Here, we present a comprehensive study of the extent to which numerous structural properties can predict sequence variation. The quantities we considered include buriedness (as measured by relative solvent accessibility), packing density (as measured by contact number), structural flexibility (as measured by B factors, root-mean-square fluctuations, and variation in dihedral angles), and variability in designed structures. We obtained structural flexibility measures both from molecular dynamics simulations performed on 9 non-homologous viral protein structures and from variation in homologous variants of those proteins, where available. We obtained measures of variability in designed structures from flexible-backbone design in the Rosetta software. We found that most of the structural properties correlate with site variation in the majority of structures, though the correlations are generally weak (correlation coefficients of 0.1 to 0.4). Moreover, we found that buriedness and packing density were better predictors of evolutionary variation than was structural flexibility. Finally, variability in designed structures was a weaker predictor of evolutionary variability than was buriedness or packing density, but it was comparable in its predictive power to the best structural flexibility measures. We conclude that simple measures of buriedness and packing density are better predictors of evolutionary variation than are more complicated predictors obtained from dynamic simulations, ensembles of homologous structures, or computational protein design.


2020 ◽  
Author(s):  
Akanksha Pandey ◽  
Edward L. Braun

AbstractMotivationProtein sequence evolution is a complex process that varies among-sites within proteins and across the tree of life. Comparisons of evolutionary rate matrices for specific taxa (‘clade-specific models’) have the potential to reveal this variation and provide information about the underlying reasons for those changes. To study changes in patterns of protein sequence evolution we estimated and compared clade-specific models in a way that acknowledged variation within proteins due to structure.ResultsClade-specific model fit was able to correctly classify proteins from four specific groups (vertebrates, plants, oomycetes, and yeasts) more than 70% of the time. This was true whether we used mixture models that incorporate relative solvent accessibility or simple models that treat sites as homogeneous. Thus, protein evolution is non-homogeneous over the tree of life. However, a small number of dimensions could explain the differences among models (for mixture models ~50% of the variance reflected relative solvent accessibility and ~25% reflected clade). Relaxed purifying selection in taxa with lower long-term effective population sizes appears to explain much of the among clade variance. Relaxed selection on solvent-exposed sites was correlated with changes in amino acid side-chain volume; other differences among models were more complex. Beyond the information they reveal about protein evolution, our clade-specific models also represent tools for phylogenomic inference.AvailabilityModel files are available from https://github.com/ebraun68/[email protected] informationSupplementary data are appended to this preprint.


Sign in / Sign up

Export Citation Format

Share Document