CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction

Protein functions are largely determined by the final details of their tertiary structures, and the structures could be accurately reconstructed based on inter-residue distances. Residue co-evolution has become the primary principle for estimating inter-residue distances since the residues in close spatial proximity tend to co-evolve. The widely-used approaches infer residue co-evolution using an indirect strategy, i.e., they first extract from the multiple sequence alignment (MSA) of query protein some handcrafted features, say, co-variance matrix, and then infer residue co-evolution using these features rather than the raw information carried by MSA. This indirect strategy always leads to considerable information loss and inaccurate estimation of inter-residue distances. Here, we report a deep neural network framework (called CopulaNet) to learn residue co-evolution directly from MSA without any handcrafted features. The CopulaNet consists of two key elements: i) an encoder to model context-specific mutation for each residue, and ii) an aggregator to model correlations among residues and thereafter infer residue co-evolutions. Using the CASP13 (the 13th Critical Assessment of Protein Structure Prediction) target proteins as representatives, we demonstrated the successful application of CopulaNet for estimating inter-residue distances and further predicting protein tertiary structure with improved accuracy and efficiency. Head-to-head comparison suggested that for 24 out of the 31 free modeling CASP13 domains, ProFOLD outperformed AlphaFold, one of the state-of-the-art prediction approaches.

Download Full-text

Liquid-theory analogy of direct-coupling analysis of multiple-sequence alignment and its implications for protein structure prediction

Biophysics and Physicobiology ◽

10.2142/biophysico.12.0_117 ◽

2015 ◽

Vol 12 (0) ◽

pp. 117-119 ◽

Cited By ~ 1

Author(s):

Akira R. Kinjo

Keyword(s):

Protein Structure ◽

Protein Structure Prediction ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Direct Coupling ◽

Coupling Analysis ◽

Multiple Sequence ◽

Liquid Theory ◽

Direct Coupling Analysis

Download Full-text

fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language

10.1101/2021.12.20.473431 ◽

2021 ◽

Author(s):

Liang Hong ◽

Siqi Sun ◽

Liangzhen Zheng ◽

Qingxiong Tan ◽

Yu Li

Keyword(s):

Protein Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Structure And Function ◽

Sequence Alignments ◽

Protein Structure And Function ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

And Function

Evolutionarily related sequences provide information for the protein structure and function. Multiple sequence alignment, which includes homolog searching from large databases and sequence alignment, is efficient to dig out the information and assist protein structure and function prediction, whose efficiency has been proved by AlphaFold. Despite the existing tools for multiple sequence alignment, searching homologs from the entire UniProt is still time-consuming. Considering the success of AlphaFold, foreseeably, large- scale multiple sequence alignments against massive databases will be a trend in the field. It is very desirable to accelerate this step. Here, we propose a novel method, fastMSA, to improve the speed significantly. Our idea is orthogonal to all the previous accelerating methods. Taking advantage of the protein language model based on BERT, we propose a novel dual encoder architecture that can embed the protein sequences into a low-dimension space and filter the unrelated sequences efficiently before running BLAST. Extensive experimental results suggest that we can recall most of the homologs with a 34-fold speed-up. Moreover, our method is compatible with the downstream tasks, such as structure prediction using AlphaFold. Using multiple sequence alignments generated from our method, we have little performance compromise on the protein structure prediction with much less running time. fastMSA will effectively assist protein sequence, structure, and function analysis based on homologs and multiple sequence alignment.

Download Full-text

Novel use of a genetic algorithm for protein structure prediction: Searching template and sequence alignment space

Proteins Structure Function and Bioinformatics ◽

10.1002/prot.10549 ◽

2003 ◽

Vol 53 (S6) ◽

pp. 424-429 ◽

Cited By ~ 21

Author(s):

Bruno Contreras-Moreira ◽

Paul W. Fitzjohn ◽

Marc Offman ◽

Graham R. Smith ◽

Paul A. Bates

Keyword(s):

Genetic Algorithm ◽

Protein Structure ◽

Protein Structure Prediction ◽

Sequence Alignment ◽

Structure Prediction

Download Full-text

Comprehensive Study on Enhancing Low-Quality Position-Specific Scoring Matrix with Deep Learning for Accurate Protein Structure Property Prediction: Using Bagging Multiple Sequence Alignment Learning

Journal of Computational Biology ◽

10.1089/cmb.2020.0416 ◽

2021 ◽

Vol 28 (4) ◽

pp. 346-361

Author(s):

Yuzhi Guo ◽

Jiaxiang Wu ◽

Hehuan Ma ◽

Sheng Wang ◽

Junzhou Huang

Keyword(s):

Deep Learning ◽

Protein Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Position Specific Scoring Matrix ◽

Structure Property ◽

Multiple Sequence ◽

Property Prediction ◽

Scoring Matrix ◽

Comprehensive Study

Download Full-text

Algorithms for Multiple Protein Structure Alignment and Structure-Derived Multiple Sequence Alignment

Protein Structure Prediction, Second Edition ◽

10.1385/1-59745-574-1:125 ◽

2007 ◽

pp. 125-146

Author(s):

Maxim Shatsky ◽

Ruth Nussinov ◽

Haim J. Wolfson

Keyword(s):

Protein Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Alignment ◽

Protein Structure Alignment ◽

Multiple Sequence ◽

Multiple Protein

Download Full-text

Prediction of Structural and Functional Aspects of Protein

Advances in Secure Computing, Internet Services, and Applications - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-4666-4940-8.ch016 ◽

2014 ◽

pp. 317-333

Author(s):

Arun G. Ingale

Keyword(s):

Protein Structure ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Tertiary Structure ◽

Protein Structures ◽

Three Dimensional ◽

Dimensional Structure ◽

Sequence Information ◽

Predict Protein Structure ◽

Basic Ideas

To predict the structure of protein from a primary amino acid sequence is computationally difficult. An investigation of the methods and algorithms used to predict protein structure and a thorough knowledge of the function and structure of proteins are critical for the advancement of biology and the life sciences as well as the development of better drugs, higher-yield crops, and even synthetic bio-fuels. To that end, this chapter sheds light on the methods used for protein structure prediction. This chapter covers the applications of modeled protein structures and unravels the relationship between pure sequence information and three-dimensional structure, which continues to be one of the greatest challenges in molecular biology. With this resource, it presents an all-encompassing examination of the problems, methods, tools, servers, databases, and applications of protein structure prediction, giving unique insight into the future applications of the modeled protein structures. In this chapter, current protein structure prediction methods are reviewed for a milieu on structure prediction, the prediction of structural fundamentals, tertiary structure prediction, and functional imminent. The basic ideas and advances of these directions are discussed in detail.

Download Full-text

Bioinspired Algorithms in Solving Three-Dimensional Protein Structure Prediction Problems

Bio-Inspired Computing for Information Retrieval Applications - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-5225-2375-8.ch012 ◽

2017 ◽

pp. 316-337

Author(s):

Raghunath Satpathy

Keyword(s):

Protein Structure ◽

Protein Structure Prediction ◽

Structure Prediction ◽

Tertiary Structure ◽

3D Structure ◽

Prediction Method ◽

Optimization Methods ◽

Point Of View ◽

Living Organisms ◽

Prediction Problems

Proteins play a vital molecular role in all living organisms. Experimentally, it is difficult to predict the protein structure, however alternatively theoretical prediction method holds good for it. The 3D structure prediction of proteins is very much important in biology and this leads to the discovery of different useful drugs, enzymes, and currently this is considered as an important research domain. The prediction of proteins is related to identification of its tertiary structure. From the computational point of view, different models (protein representations) have been developed along with certain efficient optimization methods to predict the protein structure. The bio-inspired computation is used mostly for optimization process during solving protein structure. These algorithms now a days has received great interests and attention in the literature. This chapter aim basically for discussing the key features of recently developed five different types of bio-inspired computational algorithms, applied in protein structure prediction problems.

Download Full-text

Deep Template-based Protein Structure Prediction

10.1101/2020.12.26.424433 ◽

2020 ◽

Author(s):

Fandi Wu ◽

Jinbo Xu

Keyword(s):

Protein Structure ◽

Protein Structure Prediction ◽

Random Fields ◽

Structure Prediction ◽

Conditional Random Fields ◽

3D Models ◽

Query Protein ◽

Supplementary Information ◽

Distance Information ◽

Alternating Direction

AbstractMotivationTBM (template-based modeling) is a popular method for protein structure prediction. When very good templates are not available, it is challenging to identify the best templates, build accurate sequence-template alignments and construct 3D models from alignments.ResultsThis paper presents a new method NDThreader (New Deep-learning Threader) to address the challenges of TBM. DNThreader first employs DRNF (deep convolutional residual neural fields), which is an integration of deep ResNet (convolutional residue neural networks) and CRF (conditional random fields), to align a query protein to templates without using any distance information. Then NDThreader uses ADMM (alternating direction method of multipliers) and DRNF to further improve sequence-template alignments by making use of predicted distance potential. Finally NDThreader builds 3D models from a sequence-template alignment by feeding it and sequence co-evolution information into a deep ResNet to predict inter-atom distance distribution, which is then fed into PyRosetta for 3D model construction. Our experimental results on the CASP13 and CAMEO data show that our methods outperform existing ones such as CNFpred, HHpred, DeepThreader and CEthreader. NDThreader was blindly tested in CASP14 as a part of RaptorX server, which obtained the best GDT score among all CASP14 servers on the 58 TBM targets.Availability and Implementationavailable as a part of web server at http://[email protected] InformationSupplementary data are available online.

Download Full-text

Improved computational methods of protein sequence alignment, model selection and tertiary structure prediction

10.32469/10355/46126 ◽

2013 ◽

Author(s):

◽

Xin Deng

Keyword(s):

Protein Structure ◽

Secondary Structure ◽

Model Selection ◽

Sequence Alignment ◽

Protein Sequence ◽

Structure Prediction ◽

Tertiary Structure ◽

Solvent Accessibility ◽

Relative Solvent Accessibility ◽

Tertiary Structure Prediction

Protein sequence and profile alignment has been used essentially in most bioinformatics tasks such as protein structure modeling, function prediction, and phylogenetic analysis. We designed a new algorithm MSACompro to incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into multiple protein sequence alignment. Our experiments showed that it improved multiple sequence alignment accuracy over most existing methods without using the structural information and performed comparably to the method using structural features and additional homologous sequences by slightly lower scores. We also developed HHpacom, a new profile-profile pairwise alignment by integrating secondary structure, solvent accessibility, torsion angle and inferred residue pair coupling information. The evaluation showed that the secondary structure, relative solvent accessibility and torsion angle information significantly improved the alignment accuracy in comparison with the state of the art methods HHsearch and HHsuite. The evolutionary constraint information did help in some cases, especially the alignments of the proteins which are of short lengths, typically 100 to 500 residues. Protein Model selection is also a key step in protein tertiary structure prediction. We developed two SVM model quality assessment methods taking query-template alignment as input. The assessment results illustrated that this could help improve the model selection, protein structure prediction and many other bioinformatics problems. Moreover, we also developed a protein tertiary structure prediction pipeline, of which many components were built in our groupâ€™s MULTICOM system. The MULTICOM performed well in the CASP10 (Critical Assessment of Techniques for Protein Structure Prediction) competition.

Download Full-text

Protein Complex Structure Prediction Powered by Multiple Sequence Alignment of Interologs from Multiple Taxonomic Ranks and AlphaFold2

10.1101/2021.12.21.473437 ◽

2021 ◽

Author(s):

Yunda Si ◽

Chengfei Yan

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Protein Complex ◽

Structure Prediction ◽

Complex Structure ◽

Complex Structures ◽

Success Rates ◽

Multiple Sequence ◽

Taxonomic Rank ◽

Protein Protein Interaction

AlphaFold2 is expected to be able to predict protein complex structures as long as a multiple sequence alignment (MSA) of the interologs of the target protein-protein interaction (PPI) can be provided. However, preparing the MSA of protein-protein interologs is a non-trivial task. In this study, a simplified phylogeny-based approach was applied to generate the MSA of interologs, which was then used as the input of AlphaFold2 for protein complex structure prediction. Extensively benchmarked this protocol on non-redundant PPI dataset, we show complex structures of 79.5% of the bacterial PPIs and 49.8% of the eukaryotic PPIs can be successfully predicted. Considering PPIs may not be conserved in species with long evolutionary distances, we further restricted interologs in the MSA to different taxonomic ranks of the species of the target PPI in protein complex structure prediction. We found the success rates can be increased to 87.9% for the bacterial PPIs and 56.3% of the eukaryotic PPIs if interologs in the MSA are restricted to a specific taxonomic rank of the species of each target PPI. Finally, we show the optimal taxonomic ranks for protein complex structure prediction can be selected with the application of the predicted TM-scores of the output models.

Download Full-text