Protein Complex Structure Prediction Powered by Multiple Sequence Alignment of Interologs from Multiple Taxonomic Ranks and AlphaFold2

Mapping Intimacies ◽

10.1101/2021.12.21.473437 ◽

2021 ◽

Author(s):

Yunda Si ◽

Chengfei Yan

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Protein Complex ◽

Structure Prediction ◽

Complex Structure ◽

Complex Structures ◽

Success Rates ◽

Multiple Sequence ◽

Taxonomic Rank ◽

Protein Protein Interaction

AlphaFold2 is expected to be able to predict protein complex structures as long as a multiple sequence alignment (MSA) of the interologs of the target protein-protein interaction (PPI) can be provided. However, preparing the MSA of protein-protein interologs is a non-trivial task. In this study, a simplified phylogeny-based approach was applied to generate the MSA of interologs, which was then used as the input of AlphaFold2 for protein complex structure prediction. Extensively benchmarked this protocol on non-redundant PPI dataset, we show complex structures of 79.5% of the bacterial PPIs and 49.8% of the eukaryotic PPIs can be successfully predicted. Considering PPIs may not be conserved in species with long evolutionary distances, we further restricted interologs in the MSA to different taxonomic ranks of the species of the target PPI in protein complex structure prediction. We found the success rates can be increased to 87.9% for the bacterial PPIs and 56.3% of the eukaryotic PPIs if interologs in the MSA are restricted to a specific taxonomic rank of the species of each target PPI. Finally, we show the optimal taxonomic ranks for protein complex structure prediction can be selected with the application of the predicted TM-scores of the output models.

Download Full-text

CopulaNet: Learning residue co-evolution directly from multiple sequence alignment for protein structure prediction

10.1101/2020.10.06.327585 ◽

2020 ◽

Author(s):

Fusong Ju ◽

Jianwei Zhu ◽

Bin Shao ◽

Lupeng Kong ◽

Tie-Yan Liu ◽

...

Keyword(s):

Protein Structure ◽

Protein Structure Prediction ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Tertiary Structure ◽

Query Protein ◽

Spatial Proximity ◽

Multiple Sequence ◽

Variance Matrix

Protein functions are largely determined by the final details of their tertiary structures, and the structures could be accurately reconstructed based on inter-residue distances. Residue co-evolution has become the primary principle for estimating inter-residue distances since the residues in close spatial proximity tend to co-evolve. The widely-used approaches infer residue co-evolution using an indirect strategy, i.e., they first extract from the multiple sequence alignment (MSA) of query protein some handcrafted features, say, co-variance matrix, and then infer residue co-evolution using these features rather than the raw information carried by MSA. This indirect strategy always leads to considerable information loss and inaccurate estimation of inter-residue distances. Here, we report a deep neural network framework (called CopulaNet) to learn residue co-evolution directly from MSA without any handcrafted features. The CopulaNet consists of two key elements: i) an encoder to model context-specific mutation for each residue, and ii) an aggregator to model correlations among residues and thereafter infer residue co-evolutions. Using the CASP13 (the 13th Critical Assessment of Protein Structure Prediction) target proteins as representatives, we demonstrated the successful application of CopulaNet for estimating inter-residue distances and further predicting protein tertiary structure with improved accuracy and efficiency. Head-to-head comparison suggested that for 24 out of the 31 free modeling CASP13 domains, ProFOLD outperformed AlphaFold, one of the state-of-the-art prediction approaches.

Download Full-text

Deep Neural Network for Protein Contact Prediction by Weighting Sequences in a Multiple Sequence Alignment

10.1101/331926 ◽

2018 ◽

Author(s):

Hiroyuki Fukuda ◽

Kentaro Tomii

Keyword(s):

Neural Network ◽

Supervised Learning ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Deep Neural Network ◽

Multiple Sequence ◽

Contact Prediction ◽

Meta Learning ◽

Correlation Information

AbstractProtein contact prediction is a crucially important step for protein structure prediction. To predict a contact, approaches of two types are used: evolutionary coupling analysis (ECA) and supervised learning. ECA uses a large multiple sequence alignment (MSA) of homologue sequences and extract correlation information between residues. Supervised learning uses ECA analysis results as input features and can produce higher accuracy. As described herein, we present a new approach to contact prediction which can both extract correlation information and predict contacts in a supervised manner directly from MSA using a deep neural network (DNN). Using DNN, we can obtain higher accuracy than with earlier ECA methods. Simultaneously, we can weight each sequence in MSA to eliminate noise sequences automatically in a supervised way. It is expected that the combination of our method and other meta-learning methods can provide much higher accuracy of contact prediction.

Download Full-text

Integrating Protein Secondary Structure Prediction and Multiple Sequence Alignment

Current Protein and Peptide Science ◽

10.2174/1389203043379675 ◽

2004 ◽

Vol 5 (4) ◽

pp. 249-266 ◽

Cited By ~ 35

Author(s):

V. Simossis ◽

J. Heringa

Keyword(s):

Secondary Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Protein Secondary Structure ◽

Protein Secondary Structure Prediction ◽

Multiple Sequence

Download Full-text

A Novel Comparative Sequence Analysis Method for ncRNA Secondary Structure Prediction without Multiple Sequence Alignment

2008 Fourth International Conference on Natural Computation ◽

10.1109/icnc.2008.446 ◽

2008 ◽

Author(s):

Quan Zou ◽

Mao-Zu Guo ◽

Yang Liu ◽

Zhi-An Xing

Keyword(s):

Sequence Analysis ◽

Secondary Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Comparative Sequence Analysis ◽

Analysis Method ◽

Multiple Sequence ◽

Comparative Sequence

Download Full-text

Application of multiple sequence alignment profiles to improve protein secondary structure prediction

Proteins Structure Function and Bioinformatics ◽

10.1002/1097-0134(20000815)40:3<502::aid-prot170>3.0.co;2-q ◽

2000 ◽

Vol 40 (3) ◽

pp. 502-511 ◽

Cited By ~ 484

Author(s):

James A. Cuff ◽

Geoffrey J. Barton

Keyword(s):

Secondary Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Protein Secondary Structure ◽

Protein Secondary Structure Prediction ◽

Multiple Sequence

Download Full-text

Liquid-theory analogy of direct-coupling analysis of multiple-sequence alignment and its implications for protein structure prediction

Biophysics and Physicobiology ◽

10.2142/biophysico.12.0_117 ◽

2015 ◽

Vol 12 (0) ◽

pp. 117-119 ◽

Cited By ~ 1

Author(s):

Akira R. Kinjo

Keyword(s):

Protein Structure ◽

Protein Structure Prediction ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Direct Coupling ◽

Coupling Analysis ◽

Multiple Sequence ◽

Liquid Theory ◽

Direct Coupling Analysis

Download Full-text

Integrating ab initio and template-based algorithms for protein–protein complex structure prediction

Bioinformatics ◽

10.1093/bioinformatics/btz623 ◽

2019 ◽

Vol 36 (3) ◽

pp. 751-757 ◽

Cited By ~ 1

Author(s):

Sweta Vangaveti ◽

Thom Vreven ◽

Yang Zhang ◽

Zhiping Weng

Keyword(s):

Protein Complex ◽

Structure Prediction ◽

Protein Complexes ◽

Complex Structure ◽

Protein Docking ◽

Supplementary Information ◽

Test Case ◽

Binding Modes ◽

Success Rates ◽

Template Free

Abstract Motivation Template-based and template-free methods have both been widely used in predicting the structures of protein–protein complexes. Template-based modeling is effective when a reliable template is available, while template-free methods are required for predicting the binding modes or interfaces that have not been previously observed. Our goal is to combine the two methods to improve computational protein–protein complex structure prediction. Results Here, we present a method to identify and combine high-confidence predictions of a template-based method (SPRING) with a template-free method (ZDOCK). Cross-validated using the protein–protein docking benchmark version 5.0, our method (ZING) achieved a success rate of 68.2%, outperforming SPRING and ZDOCK, with success rates of 52.1% and 35.9% respectively, when the top 10 predictions were considered per test case. In conclusion, a statistics-based method that evaluates and integrates predictions from template-based and template-free methods is more successful than either method independently. Availability and implementation ZING is available for download as a Github repository (https://github.com/weng-lab/ZING.git). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language

10.1101/2021.12.20.473431 ◽

2021 ◽

Author(s):

Liang Hong ◽

Siqi Sun ◽

Liangzhen Zheng ◽

Qingxiong Tan ◽

Yu Li

Keyword(s):

Protein Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Structure And Function ◽

Sequence Alignments ◽

Protein Structure And Function ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

And Function

Evolutionarily related sequences provide information for the protein structure and function. Multiple sequence alignment, which includes homolog searching from large databases and sequence alignment, is efficient to dig out the information and assist protein structure and function prediction, whose efficiency has been proved by AlphaFold. Despite the existing tools for multiple sequence alignment, searching homologs from the entire UniProt is still time-consuming. Considering the success of AlphaFold, foreseeably, large- scale multiple sequence alignments against massive databases will be a trend in the field. It is very desirable to accelerate this step. Here, we propose a novel method, fastMSA, to improve the speed significantly. Our idea is orthogonal to all the previous accelerating methods. Taking advantage of the protein language model based on BERT, we propose a novel dual encoder architecture that can embed the protein sequences into a low-dimension space and filter the unrelated sequences efficiently before running BLAST. Extensive experimental results suggest that we can recall most of the homologs with a 34-fold speed-up. Moreover, our method is compatible with the downstream tasks, such as structure prediction using AlphaFold. Using multiple sequence alignments generated from our method, we have little performance compromise on the protein structure prediction with much less running time. fastMSA will effectively assist protein sequence, structure, and function analysis based on homologs and multiple sequence alignment.

Download Full-text