scholarly journals DHS-Crystallize: Deep-Hybrid-Sequence based method for predicting protein Crystallization

2020 ◽  
Author(s):  
Azadeh Alavi ◽  
David B. Ascher

AbstractThe key method for determining the structure of a protein to date is X-ray crystallography, which is a very expensive technique that suffers from high attrition rate. On the contrary, a sequence-based predictor that is capable of accurately determining protein crystallization property, would not only overcome such limitations, but also would reduce the trial-and-error settings required to perform crystallization. In this work, to predict protein crystallizability, we have developed a novel sequence-based hybrid method that employs two separate, yet fully automated, concepts for extracting features from protein sequences. Specifically, we use a deep convolutional neural network on a publicly available dataset to extract descriptive features directly from the sequences, then fuse such feature with structural-and-physio-chemical driven features (such as amino-acid composition or AAIndex-based physicochemical properties). Dimentionality reduction is then performed on the resulting features and the output vectors are applied to train optimized gradient boosting machine (XGBoostt). We evaluate our method through three publicly available test sets, and show that our proposed DHS-Crystallize algorithm outperforms state-of-the-art methods, and achieves higher performance compared to using DCNN-deriven features, or structural-and-physio-chemical driven features alone.

2019 ◽  
Vol 36 (5) ◽  
pp. 1429-1438 ◽  
Author(s):  
Abdurrahman Elbasir ◽  
Raghvendra Mall ◽  
Khalid Kunji ◽  
Reda Rawi ◽  
Zeyaul Islam ◽  
...  

Abstract Motivation X-ray crystallography has facilitated the majority of protein structures determined to date. Sequence-based predictors that can accurately estimate protein crystallization propensities would be highly beneficial to overcome the high expenditure, large attrition rate, and to reduce the trial-and-error settings required for crystallization. Results In this study, we present a novel model, BCrystal, which uses an optimized gradient boosting machine (XGBoost) on sequence, structural and physio-chemical features extracted from the proteins of interest. BCrystal also provides explanations, highlighting the most important features for the predicted crystallization propensity of an individual protein using the SHAP algorithm. On three independent test sets, BCrystal outperforms state-of-the-art sequence-based methods by more than 12.5% in accuracy, 18% in recall and 0.253 in Matthew’s correlation coefficient, with an average accuracy of 93.7%, recall of 96.63% and Matthew’s correlation coefficient of 0.868. For relative solvent accessibility of exposed residues, we observed higher values to associate positively with protein crystallizability and the number of disordered regions, fraction of coils and tripeptide stretches that contain multiple histidines associate negatively with crystallizability. The higher accuracy of BCrystal enables it to accurately screen for sequence variants with enhanced crystallizability. Availability and implementation Our BCrystal webserver is at https://machinelearning-protein.qcri.org/ and source code is available at https://github.com/raghvendra5688/BCrystal. Supplementary information Supplementary data are available at Bioinformatics online.


2011 ◽  
Vol 65 (4) ◽  
Author(s):  
Edward Seclaman ◽  
Alina Bora ◽  
Sorin Avram ◽  
Zeno Simon ◽  
Ludovic Kurunczi

AbstractA series of 36 substituted 2-phenylindoles was analysed using minimal topological difference-projections in latent structures variant (MTD-PLS) and molecular docking, using fast rigid exhaustive docking (FRED) and AutoDock Vina programs. For quantitative structure activity relationships (QSAR) validation, a sphere exclusion algorithm in the multi-dimensional descriptor space was used to construct training and test sets. Docking procedures were based on X-ray crystallography studies using the human alpha oestrogen receptor-17β-oestradiol complex. The ranking abilities of the different scoring functions of the FRED package were presented, and the most suitable scoring function (Chemgauss3) for the oestrogen receptor was chosen. Although the series studied contains only a limited number of compounds, the MTD-PLS method and the docking procedure provided coherent results in concordance with the X-ray diffraction data for different ligand-oestrogen receptor complexes.


2018 ◽  
Vol 35 (13) ◽  
pp. 2216-2225 ◽  
Author(s):  
Abdurrahman Elbasir ◽  
Balasubramanian Moovarkumudalvan ◽  
Khalid Kunji ◽  
Prasanna R Kolatkar ◽  
Raghvendra Mall ◽  
...  

Abstract Motivation Protein structure determination has primarily been performed using X-ray crystallography. To overcome the expensive cost, high attrition rate and series of trial-and-error settings, many in-silico methods have been developed to predict crystallization propensities of proteins based on their sequences. However, the majority of these methods build their predictors by extracting features from protein sequences, which is computationally expensive and can explode the feature space. We propose DeepCrystal, a deep learning framework for sequence-based protein crystallization prediction. It uses deep learning to identify proteins which can produce diffraction-quality crystals without the need to manually engineer additional biochemical and structural features from sequence. Our model is based on convolutional neural networks, which can exploit frequently occurring k-mers and sets of k-mers from the protein sequences to distinguish proteins that will result in diffraction-quality crystals from those that will not. Results Our model surpasses previous sequence-based protein crystallization predictors in terms of recall, F-score, accuracy and Matthew’s correlation coefficient (MCC) on three independent test sets. DeepCrystal achieves an average improvement of 1.4, 12.1% in recall, when compared to its closest competitors, Crysalis II and Crysf, respectively. In addition, DeepCrystal attains an average improvement of 2.1, 6.0% for F-score, 1.9, 3.9% for accuracy and 3.8, 7.0% for MCC w.r.t. Crysalis II and Crysf on independent test sets. Availability and implementation The standalone source code and models are available at https://github.com/elbasir/DeepCrystal and a web-server is also available at https://deeplearning-protein.qcri.org. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Yi-Heng Zhu ◽  
Jun Hu ◽  
Fang Ge ◽  
Fuyi Li ◽  
Jiangning Song ◽  
...  

Abstract X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew’s correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.


2015 ◽  
Vol 71 (8) ◽  
pp. 1777-1787 ◽  
Author(s):  
Muriel Gelin ◽  
Vanessa Delfosse ◽  
Frédéric Allemand ◽  
François Hoh ◽  
Yoann Sallaz-Damaz ◽  
...  

X-ray crystallography is an established technique for ligand screening in fragment-based drug-design projects, but the required manual handling steps – soaking crystals with ligand and the subsequent harvesting – are tedious and limit the throughput of the process. Here, an alternative approach is reported: crystallization plates are pre-coated with potential binders prior to protein crystallization and X-ray diffraction is performed directly `in situ' (or in-plate). Its performance is demonstrated on distinct and relevant therapeutic targets currently being studied for ligand screening by X-ray crystallography using either a bending-magnet beamline or a rotating-anode generator. The possibility of using DMSO stock solutions of the ligands to be coated opens up a route to screening most chemical libraries.


CrystEngComm ◽  
2021 ◽  
Author(s):  
Raquel dos Santos ◽  
Maria João Romão ◽  
Ana C A Roque ◽  
Ana Luisa Moreira Carvalho

After more than one hundred and thirty thousand protein structures determined by X-ray crystallography, the challenge of protein crystallization for 3D structure determination remains. In the quest for additives for...


2017 ◽  
Vol 4 (4) ◽  
pp. 557-575 ◽  
Author(s):  
Joshua Holcomb ◽  
◽  
Nicholas Spellmon ◽  
Yingxue Zhang ◽  
Maysaa Doughan ◽  
...  

2021 ◽  
Author(s):  
Alexander Derry ◽  
Kristy A. Carpenter ◽  
Russ B. Altman

The three-dimensional structures of proteins are crucial for understanding their molecular mechanisms and interactions. Machine learning algorithms that are able to learn accurate representations of protein structures are therefore poised to play a key role in protein engineering and drug development. The accuracy of such models in deployment is directly influenced by training data quality. The use of different experimental methods for protein structure determination may introduce bias into the training data. In this work, we evaluate the magnitude of this effect across three distinct tasks: estimation of model accuracy, protein sequence design, and catalytic residue prediction. Most protein structures are derived from X-ray crystallography, nuclear magnetic resonance (NMR), or cryo-electron microscopy (cryo-EM); we trained each model on datasets consisting of either all three structure types or of only X-ray data. We find that across these tasks, models consistently perform worse on test sets derived from NMR and cryo-EM than they do on test sets of structures derived from X-ray crystallography, but that the difference can be mitigated when NMR and cryo-EM structures are included in the training set. Importantly, we show that including all three types of structures in the training set does not degrade test performance on X-ray structures, and in some cases even increases it. Finally, we examine the relationship between model performance and the biophysical properties of each method, and recommend that the biochemistry of the task of interest should be considered when composing training sets.


Sign in / Sign up

Export Citation Format

Share Document