Study of Real-Valued Distance Prediction For Protein Structure Prediction with Deep Learning

Structure Prediction ◽

3D Structure ◽

Prediction Method ◽

Structure Modeling ◽

Contact Prediction ◽

Real Value ◽

3D Structure Modeling ◽

AbstractInter-residue distance prediction by deep ResNet (convolutional residual neural network) has greatly advanced protein structure prediction. Currently the most successful structure prediction methods predict distance by discretizing it into dozens of bins. Here we study how well real-valued distance can be predicted and how useful it is for 3D structure modeling by comparing it with discrete-valued prediction based upon the same deep ResNet. Different from the recent methods that predict only a single real value for the distance of an atom pair, we predict both the mean and standard deviation of a distance and then employ a novel method to fold a protein by the predicted mean and deviation. Our findings include: 1) tested on the CASP13 FM (free-modeling) targets, our real-valued distance prediction obtains 81% precision on top L/5 long-range contact prediction, much better than the best CASP13 results (70%); 2) our real-valued prediction can predict correct folds for the same number of CASP13 FM targets as the best CASP13 group, despite generating only 20 decoys for each target; 3) our method greatly outperforms a very new real-valued prediction method DeepDist in both contact prediction and 3D structure modeling; and 4) when the same deep ResNet is used, our real-valued distance prediction has 1-6% higher contact and distance accuracy than our own discrete-valued prediction, but less accurate 3D structure models.

Bio-Inspired Computing for Information Retrieval Applications - Advances in Knowledge Acquisition, Transfer, and Management ◽

Bioinspired Algorithms in Solving Three-Dimensional Protein Structure Prediction Problems

10.4018/978-1-5225-2375-8.ch012 ◽

2017 ◽

pp. 316-337

Author(s):

Raghunath Satpathy

Keyword(s):

Protein Structure ◽

Structure Prediction ◽

Tertiary Structure ◽

3D Structure ◽

Prediction Method ◽

Optimization Methods ◽

Point Of View ◽

Living Organisms ◽

Prediction Problems

Proteins play a vital molecular role in all living organisms. Experimentally, it is difficult to predict the protein structure, however alternatively theoretical prediction method holds good for it. The 3D structure prediction of proteins is very much important in biology and this leads to the discovery of different useful drugs, enzymes, and currently this is considered as an important research domain. The prediction of proteins is related to identification of its tertiary structure. From the computational point of view, different models (protein representations) have been developed along with certain efficient optimization methods to predict the protein structure. The bio-inspired computation is used mostly for optimization process during solving protein structure. These algorithms now a days has received great interests and attention in the literature. This chapter aim basically for discussing the key features of recently developed five different types of bio-inspired computational algorithms, applied in protein structure prediction problems.

Improving deep learning-based protein distance prediction in CASP14

10.1101/2021.02.02.429462 ◽

2021 ◽

Author(s):

Zhiye Guo ◽

Tianqi Wu ◽

Jian Liu ◽

Jie Hou ◽

Jianlin Cheng

Keyword(s):

Deep Learning ◽

Protein Structure ◽

Structure Prediction ◽

Prediction Method ◽

Learning Method ◽

Sequence Alignments ◽

Evolutionary Features ◽

Protein Distance ◽

AbstractAccurate prediction of residue-residue distances is important for protein structure prediction. We developed several protein distance predictors based on a deep learning distance prediction method and blindly tested them in the 14th Critical Assessment of Protein Structure Prediction (CASP14). The prediction method uses deep residual neural networks with the channel-wise attention mechanism to classify the distance between every two residues into multiple distance intervals. The input features for the deep learning method include co-evolutionary features as well as other sequence-based features derived from multiple sequence alignments (MSAs). Three alignment methods are used with multiple protein sequence/profile databases to generate MSAs for input feature generation. Based on different configurations and training strategies of the deep learning method, five MULTICOM distance predictors were created to participate in the CASP14 experiment. Benchmarked on 37 hard CASP14 domains, the best performing MULTICOM predictor is ranked 5th out of 30 automated CASP14 distance prediction servers in terms of precision of top L/5 long-range contact predictions (i.e. classifying distances between two residues into two categories: in contact (< 8 Angstrom) and not in contact otherwise) and performs better than the best CASP13 distance prediction method. The best performing MULTICOM predictor is also ranked 6th among automated server predictors in classifying inter-residue distances into 10 distance intervals defined by CASP14 according to the F1 measure. The results show that the quality and depth of MSAs depend on alignment methods and sequence databases and have a significant impact on the accuracy of distance prediction. Using larger training datasets and multiple complementary features improves prediction accuracy. However, the number of effective sequences in MSAs is only a weak indicator of the quality of MSAs and the accuracy of predicted distance maps. In contrast, there is a strong correlation between the accuracy of contact/distance predictions and the average probability of the predicted contacts, which can therefore be more effectively used to estimate the confidence of distance predictions and select predicted distance maps.

DISTEVAL: a web server for evaluating predicted protein distances

BMC Bioinformatics ◽

10.1186/s12859-020-03938-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Badri Adhikari ◽

Bikash Shrestha ◽

Matthew Bernardini ◽

Jie Hou ◽

Jamie Lea

Keyword(s):

Protein Structure ◽

Structure Prediction ◽

Mean Squared Error ◽

3D Structure ◽

Web Server ◽

Absolute Error ◽

3D Models ◽

Qualitative Assessment ◽

Abstract Background Protein inter-residue contact and distance prediction are two key intermediate steps essential to accurate protein structure prediction. Distance prediction comes in two forms: real-valued distances and ‘binned’ distograms, which are a more finely grained variant of the binary contact prediction problem. The latter has been introduced as a new challenge in the 14th Critical Assessment of Techniques for Protein Structure Prediction (CASP14) 2020 experiment. Despite the recent proliferation of methods for predicting distances, few methods exist for evaluating these predictions. Currently only numerical metrics, which evaluate the entire prediction at once, are used. These give no insight into the structural details of a prediction. For this reason, new methods and tools are needed. Results We have developed a web server for evaluating predicted inter-residue distances. Our server, DISTEVAL, accepts predicted contacts, distances, and a true structure as optional inputs to generate informative heatmaps, chord diagrams, and 3D models. All of these outputs facilitate visual and qualitative assessment. The server also evaluates predictions using other metrics such as mean absolute error, root mean squared error, and contact precision. Conclusions The visualizations generated by DISTEVAL complement each other and collectively serve as a powerful tool for both quantitative and qualitative assessments of predicted contacts and distances, even in the absence of a true 3D structure.

Analysis of distance-based protein structure prediction by deep learning in CASP13

10.1101/624460 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jinbo Xu ◽

Sheng Wang

Keyword(s):

Deep Learning ◽

3D Modeling ◽

Structure Prediction ◽

3D Structure ◽

3D Models ◽

Evolutionary Information ◽

Structure Modeling ◽

Multiple Sequence ◽

Contact Prediction ◽

3D Structure Modeling

AbstractThis paper reports the CASP13 results of distance-based contact prediction, threading and folding methods implemented in three RaptorX servers, which are built upon the powerful deep convolutional residual neural network (ResNet) method initiated by us for contact prediction in CASP12. On the 32 CASP13 FM (free-modeling) targets with a median MSA (multiple sequence alignment) depth of 36, RaptorX yielded the best contact prediction among 46 groups and almost the best 3D structure modeling among all server groups without time-consuming conformation sampling. In particular, RaptorX achieved top L/5, L/2 and L long-range contact precision of 70%, 58% and 45%, respectively, and predicted correct folds (TMscore>0.5) for 18 of 32 targets. Although on average underperforming AlphaFold in 3D modeling, RaptorX predicted correct folds for all FM targets with >300 residues (T0950-D1, T0969-D1 and T1000-D2) and generated the best 3D models for T0950-D1 and T0969-D1 among all groups. This CASP13 test confirms our previous findings: (1) predicted distance is more useful than contacts for both template-based and free modeling; and (2) structure modeling may be improved by integrating alignment and co-evolutionary information via deep learning. This paper will discuss progress we have made since CASP12, the strength and weakness of our methods, and why deep learning performed much better in CASP13.

Performance Analysis of Deep Learning Methods for Protein Contact Prediction in CASP13

CLEI electronic journal ◽

10.19153/cleiej.24.2.3 ◽

2021 ◽

Vol 24 (2) ◽

Author(s):

Romina Valdez ◽

Khevin Roig ◽

Diego P. Pinto-Roa ◽

Jose Colbes

Keyword(s):

Deep Learning ◽

Protein Structure ◽

Structure Prediction ◽

3D Structure ◽

Dimensional Structure ◽

Structural Classification ◽

Data Set ◽

Contact Prediction

Protein structure prediction is one of the most important problems in Computational Biology; and consists of determining the 3D structure of a protein given its amino acid sequence. A key component that has allowed considerable improvements in recent decades is the prediction of contacts in a protein, since it provides fundamental information about its three-dimensional structure. In the 13th edition of the CASP (Critical Assessment of protein Structure Prediction), a notable progress has been evidenced for both problems with the use of deep learning algorithms. For the contact prediction category, the best methods in CASP13 achieved an average precision of 70%. In the present work, the performance of these methods is analyzed using a larger data set, with 483 proteins from four families according to the structural classification of the SCOP database (Structural Classification of Proteins). The selected methods were evaluated using the CASP metrics, and their results indicate an average contact prediction precision greater than 90%. SPOT-Contact was the method with the best overall performance, and one of the methods with the best performance for each SCOP class. The set of proteins used for the experiments and the implementations made for the analysis are publicly available.

Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13

10.1101/552422 ◽

2019 ◽

Cited By ~ 7

Author(s):

Jie Hou ◽

Tianqi Wu ◽

Renzhi Cao ◽

Jianlin Cheng

Keyword(s):

Deep Learning ◽

Protein Structure ◽

Structure Prediction ◽

Tertiary Structure ◽

Structure Modeling ◽

Contact Distance ◽

Protein Model ◽

Template Free ◽

AbstractPrediction of residue-residue distance relationships (e.g. contacts) has become the key direction to advance protein tertiary structure prediction since 2014 CASP11 experiment, while deep learning has revolutionized the technology for contact and distance distribution prediction since its debut in 2012 CASP10 experiment. During 2018 CASP13 experiment, we enhanced our MULTICOM protein structure prediction system with three major components: contact distance prediction based on deep convolutional neural networks, contact distance-driven template-free (ab initio) modeling, and protein model ranking empowered by deep learning and contact prediction, in addition to an update of other components such as template library, sequence database, and alignment tools. Our experiment demonstrates that contact distance prediction and deep learning methods are the key reasons that MULTICOM was ranked 3rd out of all 98 predictors in both template-free and template-based protein structure modeling in CASP13. Deep convolutional neural network can utilize global information in pairwise residue-residue features such as co-evolution scores to substantially improve inter-residue contact distance prediction, which played a decisive role in correctly folding some free modeling and hard template-based modeling targets from scratch. Deep learning also successfully integrated 1D structural features, 2D contact information, and 3D structural quality scores to improve protein model quality assessment, where the contact prediction was demonstrated to consistently enhance ranking of protein models for the first time. The success of MULTICOM system in the CASP13 experiment clearly shows that protein contact distance prediction and model selection driven by powerful deep learning holds the key of solving protein structure prediction problem. However, there are still major challenges in accurately predicting protein contact distance when there are few homologous sequences to generate co-evolutionary signals, folding proteins from noisy contact distances, and ranking models of hard targets.

MULTICOM2: an open-source protein structure prediction system powered by deep learning and distance prediction

10.21203/rs.3.rs-339464/v1 ◽

2021 ◽

Author(s):

Tianqi Wu ◽

Jian Liu ◽

Zhiye Guo ◽

Jie Hou ◽

Jianlin Cheng

Keyword(s):

Deep Learning ◽

Protein Structure ◽

Open Source ◽

Structure Prediction ◽

Tertiary Structure ◽

Modeling Method ◽

Structure Modeling ◽

Prediction System ◽

Template Free

Abstract Protein structure prediction is an important problem in bioinformatics and has been studied for decades. However, there are still few open-source comprehensive protein structure prediction packages publicly available in the field. In this paper, we present our latest open-source protein tertiary structure prediction system - MULTICOM2, an integration of template-based modeling (TBM) and template-free modeling (FM) methods. The template-based modeling uses sequence alignment tools with deep multiple sequence alignments to search for structural templates, which are much faster and more accurate than MULTICOM1. The template-free (ab initio or de novo) modeling uses the inter-residue distances predicted by DeepDist to reconstruct tertiary structure models without using any known structure as template. In the blind CASP14 experiment, the average TM-score of the models predicted by our server predictor based on the MULTICOM2 system is 0.720 for 58 TBM (regular) domains and 0.514 for 38 FM and FM/TBM (hard) domains, indicating that MULTICOM2 is capable of predicting good tertiary structures across the board. It can predict the correct fold for 76 CASP14 domains (95% regular domains and 55% hard domains) if only one prediction is made for a domain. The success rate is increased to 3% for both regular and hard domains if five predictions are made per domain. Moreover, the prediction accuracy of the pure template-free structure modeling method on both TBM and FM targets is very close to the combination of template-based and template-free modeling methods. This demonstrates that the distance-based template-free modeling method powered by deep learning can largely replace the traditional template-based modeling method even on TBM targets that TBM methods used to dominate and therefore provides a uniform structure modeling approach to any protein. Finally, on the 38 CASP14 FM and FM/TBM hard domains, MULTICOM2 server predictors (MULTICOM-HYBRID, MULTICOM-DEEP, MULTICOM-DIST) were ranked among the top 20 automated server predictors in the CASP14 experiment. After combining multiple predictors from the same research group as one entry, MULTICOM-HYBRID was ranked no. 5. The source code of MULTICOM2 is freely available at https://github.com/multicom-toolbox/multicom/tree/multicom_v2.0.

AttentiveDist: Protein Inter-Residue Distance Prediction Using Deep Learning with Attention on Quadruple Multiple Sequence Alignments

10.1101/2020.11.24.396770 ◽

2020 ◽

Author(s):

Aashish Jain ◽

Genki Terashi ◽

Yuki Kagaya ◽

Sai Raghavendra Maddhuri Venkata Subramaniya ◽

Charles Christoffer ◽

...

Keyword(s):

Deep Learning ◽

Structure Prediction ◽

Prediction Models ◽

3D Structure ◽

Evolutionary Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Contact Prediction ◽

Multiple Sequence Alignments ◽

ABSTRACTProtein 3D structure prediction has advanced significantly in recent years due to improving contact prediction accuracy. This improvement has been largely due to deep learning approaches that predict inter-residue contacts and, more recently, distances using multiple sequence alignments (MSAs). In this work we present AttentiveDist, a novel approach that uses different MSAs generated with different E-values in a single model to increase the co-evolutionary information provided to the model. To determine the importance of each MSA’s feature at the inter-residue level, we added an attention layer to the deep neural network. The model is trained in a multi-task fashion to also predict backbone and orientation angles further improving the inter-residue distance prediction. We show that AttentiveDist outperforms the top methods for contact prediction in the CASP13 structure prediction competition. To aid in structure modeling we also developed two new deep learning-based sidechain center distance and peptide-bond nitrogen-oxygen distance prediction models. Together these led to a 12% increase in TM-score from the best server method in CASP13 for structure prediction.

Phylogenetic correlations have limited effect on coevolution-based contact prediction in proteins

10.1101/2020.08.12.247577 ◽

2020 ◽

Cited By ~ 1

Author(s):

Edwin Rodriguez Horta ◽

Martin Weigt

Keyword(s):

Protein Structure ◽

Amino Acid ◽

Structure Prediction ◽

Protein Sequences ◽

Protein Families ◽

Coupling Analysis ◽

Contact Prediction ◽

Phylogenetic Relations ◽

Direct Coupling Analysis

AbstractCoevolution-based contact prediction, either directly by coevolutionary couplings resulting from global statistical sequence models or using structural supervision and deep learning, has found widespread application in protein-structure prediction from sequence. However, one of the basic assumptions in global statistical modeling is that sequences form an at least approximately independent sample of an unknown probability distribution, which is to be learned from data. In the case of protein families, this assumption is obviously violated by phylogenetic relations between protein sequences. It has turned out to be notoriously difficult to take phylogenetic correlations into account in coevolutionary model learning. Here, we propose a complementary approach: we develop two strategies to randomize or resample sequence data, such that conservation patterns and phylogenetic relations are preserved, while intrinsic (i.e. structure- or function-based) coevolutionary couplings are removed. An analysis of these data shows that the strongest coevolutionary couplings, i.e. those used by Direct Coupling Analysis to predict contacts, are only weakly influenced by phylogeny. However, phylogeny-induced spurious couplings are of similar size to the bulk of coevolutionary couplings, and dissecting functional from phylogeny-induced couplings might lead to more accurate contact predictions in the range of intermediate-size couplings.The code is available at https://github.com/ed-rodh/Null_models_I_and_II.Author summaryMany homologous protein families contain thousands of highly diverged amino-acid sequences, which fold in close-to-identical three-dimensional structures and fulfill almost identical biological tasks. Global coevolutionary models, like those inferred by the Direct Coupling Analysis (DCA), assume that families can be considered as samples of some unknown statistical model, and that the parameters of these models represent evolutionary constraints acting on protein sequences. To learn these models from data, DCA and related approaches have to also assume that the distinct sequences in a protein family are close to independent, while in reality they are characterized by involved hierarchical phylogenetic relationships. Here we propose Null models for sequence alignments, which maintain patterns of amino-acid conservation and phylogeny contained in the data, but destroy any coevolutionary couplings, frequently used in protein structure prediction. We find that phylogeny actually induces spurious non-zero couplings. These are, however, significantly smaller that the largest couplings derived from natural sequences, and therefore have only little influence on the first predicted contacts. However, in the range of intermediate couplings, they may lead to statistically significant effects. Dissecting phylogenetic from functional couplings might therefore extend the range of accurately predicted structural contacts down to smaller coupling strengths than those currently used.