DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins

Chengxin Zhang; Wei Zheng; S M Mortuza; Yang Li; Yang Zhang

doi:10.1093/bioinformatics/btz863

DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins

Bioinformatics ◽

10.1093/bioinformatics/btz863 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2105-2112 ◽

Cited By ~ 14

Author(s):

Chengxin Zhang ◽

Wei Zheng ◽

S M Mortuza ◽

Yang Li ◽

Yang Zhang

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Secondary Structure Prediction ◽

Supplementary Information ◽

Structure Identification ◽

Whole Genome ◽

Multiple Sequence ◽

Contact Prediction ◽

Homologous Sequences

Abstract Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. Availability and implementation https://zhanglab.ccmb.med.umich.edu/DeepMSA/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

MAGUS: Multiple sequence Alignment using Graph clUStering

Bioinformatics ◽

10.1093/bioinformatics/btaa992 ◽

2020 ◽

Author(s):

Vladimir Smirnov ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Graph Clustering ◽

Divide And Conquer ◽

Supplementary Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Full Dataset ◽

A New Technique

Abstract Motivation The estimation of large multiple sequence alignments (MSAs) is a basic bioinformatics challenge. Divide-and-conquer is a useful approach that has been shown to improve the scalability and accuracy of MSA estimation in established methods such as SATé and PASTA. In these divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g. MAFFT), and then merged together into an alignment on the full dataset. Results We present MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS is similar to PASTA in that it uses nearly the same initial steps (starting tree, similar decomposition strategy, and MAFFT to compute subset alignments), but then merges the subset alignments using the Graph Clustering Merger, a new method for combining disjoint alignments that we present in this study. Our study, on a heterogeneous collection of biological and simulated datasets, shows that MAGUS produces improved accuracy and is faster than PASTA on large datasets, and matches it on smaller datasets. Availability and implementation MAGUS: https://github.com/vlasmirnov/MAGUS Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATé

Methods in Molecular Biology - Multiple Sequence Alignment Methods ◽

10.1007/978-1-62703-646-7_15 ◽

2013 ◽

pp. 219-244 ◽

Cited By ~ 14

Author(s):

Kevin Liu ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Multiple Sequence ◽

Tree Estimation

Download Full-text

Deep Neural Network for Protein Contact Prediction by Weighting Sequences in a Multiple Sequence Alignment

10.1101/331926 ◽

2018 ◽

Author(s):

Hiroyuki Fukuda ◽

Kentaro Tomii

Keyword(s):

Neural Network ◽

Supervised Learning ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Deep Neural Network ◽

Multiple Sequence ◽

Contact Prediction ◽

Meta Learning ◽

Correlation Information

AbstractProtein contact prediction is a crucially important step for protein structure prediction. To predict a contact, approaches of two types are used: evolutionary coupling analysis (ECA) and supervised learning. ECA uses a large multiple sequence alignment (MSA) of homologue sequences and extract correlation information between residues. Supervised learning uses ECA analysis results as input features and can produce higher accuracy. As described herein, we present a new approach to contact prediction which can both extract correlation information and predict contacts in a supervised manner directly from MSA using a deep neural network (DNN). Using DNN, we can obtain higher accuracy than with earlier ECA methods. Simultaneously, we can weight each sequence in MSA to eliminate noise sequences automatically in a supervised way. It is expected that the combination of our method and other meta-learning methods can provide much higher accuracy of contact prediction.

Download Full-text

Multiple sequence alignment: a major challenge to large-scale phylogenetics

PLoS Currents ◽

10.1371/currents.rrn1198 ◽

2011 ◽

Vol 2 ◽

pp. RRN1198 ◽

Cited By ~ 12

Author(s):

Kevin Liu ◽

C. Randal Linder ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Multiple Sequence

Download Full-text

Sequoya: multiobjective multiple sequence alignment in Python

Bioinformatics ◽

10.1093/bioinformatics/btaa257 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3892-3893

Author(s):

Antonio Benítez-Hidalgo ◽

Antonio J Nebro ◽

José F Aldana-Montes

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Software Tool ◽

Computing System ◽

Supplementary Information ◽

Optimization Approach ◽

Multiple Sequence ◽

Graphical Tool ◽

Optimal Alignments ◽

Python Programming

Abstract Motivation Multiple sequence alignment (MSA) consists of finding the optimal alignment of three or more biological sequences to identify highly conserved regions that may be the result of similarities and relationships between the sequences. MSA is an optimization problem with NP-hard complexity (non-deterministic polynomial-time hardness), because the time needed to find optimal alignments raises exponentially along with the number of sequences and their length. Furthermore, the problem becomes multiobjective when more than one score is considered to assess the quality of an alignment, such as maximizing the percentage of totally conserved columns and minimizing the number of gaps. Our motivation is to provide a Python tool for solving MSA problems using evolutionary algorithms, a nonexact stochastic optimization approach that has proven to be effective to solve multiobjective problems. Results The software tool we have developed, called Sequoya, is written in the Python programming language, which offers a broad set of libraries for data analysis, visualization and parallelism. Thus, Sequoya offers a graphical tool to visualize the progress of the optimization in real time, the ability to guide the search toward a preferred region in run-time, parallel support to distribute the computation among nodes in a distributed computing system, and a graphical component to assist in the analysis of the solutions found at the end of the optimization. Availability and implementation Sequoya can be freely obtained from the Python Package Index (pip) or, alternatively, it can be downloaded from Github at https://github.com/benhid/Sequoya. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Integrating Protein Secondary Structure Prediction and Multiple Sequence Alignment

Current Protein and Peptide Science ◽

10.2174/1389203043379675 ◽

2004 ◽

Vol 5 (4) ◽

pp. 249-266 ◽

Cited By ~ 35

Author(s):

V. Simossis ◽

J. Heringa

Keyword(s):

Secondary Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Protein Secondary Structure ◽

Protein Secondary Structure Prediction ◽

Multiple Sequence

Download Full-text

Multi-GPU Approach for Large-Scale Multiple Sequence Alignment

10.1007/978-3-030-86653-2_41 ◽

2021 ◽

pp. 560-575

Author(s):

Rodrigo A. de O. Siqueira ◽

Marco A. Stefanes ◽

Luiz C. S. Rozante ◽

David C. Martins-Jr ◽

Jorge E. S. de Souza ◽

...

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Large Scale ◽

Multiple Sequence

Download Full-text

A Novel Comparative Sequence Analysis Method for ncRNA Secondary Structure Prediction without Multiple Sequence Alignment

2008 Fourth International Conference on Natural Computation ◽

10.1109/icnc.2008.446 ◽

2008 ◽

Author(s):

Quan Zou ◽

Mao-Zu Guo ◽

Yang Liu ◽

Zhi-An Xing

Keyword(s):

Sequence Analysis ◽

Secondary Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Comparative Sequence Analysis ◽

Analysis Method ◽

Multiple Sequence ◽

Comparative Sequence

Download Full-text

Protein multiple alignments: sequence-based versus structure-based programs

Bioinformatics ◽

10.1093/bioinformatics/btz236 ◽

2019 ◽

Vol 35 (20) ◽

pp. 3970-3980 ◽

Cited By ~ 6

Author(s):

Mathilde Carpentier ◽

Jacques Chomilier

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Added Value ◽

Supplementary Information ◽

Supplementary Data ◽

Sequence Structure ◽

Multiple Sequence ◽

Sequence Identity ◽

Multiple Alignments ◽

Low Levels

Abstract Motivation Multiple sequence alignment programs have proved to be very useful and have already been evaluated in the literature yet not alignment programs based on structure or both sequence and structure. In the present article we wish to evaluate the added value provided through considering structures. Results We compared the multiple alignments resulting from 25 programs either based on sequence, structure or both, to reference alignments deposited in five databases (BALIBASE 2 and 3, HOMSTRAD, OXBENCH and SISYPHUS). On the whole, the structure-based methods compute more reliable alignments than the sequence-based ones, and even than the sequence+structure-based programs whatever the databases. Two programs lead, MAMMOTH and MATRAS, nevertheless the performances of MUSTANG, MATT, 3DCOMB, TCOFFEE+TM_ALIGN and TCOFFEE+SAP are better for some alignments. The advantage of structure-based methods increases at low levels of sequence identity, or for residues in regular secondary structures or buried ones. Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built databases are more challenging for the programs. Availability and implementation All data and results presented in this study are available at: http://wwwabi.snv.jussieu.fr/people/mathilde/download/AliMulComp/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Application of multiple sequence alignment profiles to improve protein secondary structure prediction

Proteins Structure Function and Bioinformatics ◽

10.1002/1097-0134(20000815)40:3<502::aid-prot170>3.0.co;2-q ◽

2000 ◽

Vol 40 (3) ◽

pp. 502-511 ◽

Cited By ~ 484

Author(s):

James A. Cuff ◽

Geoffrey J. Barton

Keyword(s):

Secondary Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

Protein Secondary Structure ◽

Protein Secondary Structure Prediction ◽

Multiple Sequence

Download Full-text