MAGUS: Multiple sequence Alignment using Graph clUStering

Large Scale ◽

Graph Clustering ◽

Divide And Conquer ◽

Sequence Alignments ◽

Multiple Sequence ◽

Full Dataset ◽

A New Technique

Abstract Motivation The estimation of large multiple sequence alignments (MSAs) is a basic bioinformatics challenge. Divide-and-conquer is a useful approach that has been shown to improve the scalability and accuracy of MSA estimation in established methods such as SATé and PASTA. In these divide-and-conquer strategies, a sequence dataset is divided into disjoint subsets, alignments are computed on the subsets using base MSA methods (e.g. MAFFT), and then merged together into an alignment on the full dataset. Results We present MAGUS, Multiple sequence Alignment using Graph clUStering, a new technique for computing large-scale alignments. MAGUS is similar to PASTA in that it uses nearly the same initial steps (starting tree, similar decomposition strategy, and MAFFT to compute subset alignments), but then merges the subset alignments using the Graph Clustering Merger, a new method for combining disjoint alignments that we present in this study. Our study, on a heterogeneous collection of biological and simulated datasets, shows that MAGUS produces improved accuracy and is faster than PASTA on large datasets, and matches it on smaller datasets. Availability and implementation MAGUS: https://github.com/vlasmirnov/MAGUS Supplementary information Supplementary data are available at Bioinformatics online.

DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins

Bioinformatics ◽

10.1093/bioinformatics/btz863 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2105-2112 ◽

Cited By ~ 14

Author(s):

Chengxin Zhang ◽

Wei Zheng ◽

S M Mortuza ◽

Yang Li ◽

Yang Zhang

Keyword(s):

Sequence Alignment ◽

Large Scale ◽

Secondary Structure Prediction ◽

Structure Identification ◽

Whole Genome ◽

Multiple Sequence ◽

Contact Prediction ◽

Homologous Sequences

Abstract Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. Availability and implementation https://zhanglab.ccmb.med.umich.edu/DeepMSA/. Supplementary information Supplementary data are available at Bioinformatics online.

Molecular homology and multiple-sequence alignment: an analysis of concepts and practice

Australian Systematic Botany ◽

10.1071/sb15001 ◽

2015 ◽

Vol 28 (1) ◽

pp. 46 ◽

Cited By ~ 20

Author(s):

David A. Morrison ◽

Matthew J. Morgan ◽

Scot A. Kelchner

Keyword(s):

Sequence Alignment ◽

Molecular Data ◽

Simple Relationship ◽

Sequence Alignments ◽

Multiple Sequence ◽

Molecular Change ◽

Nucleotide Homology ◽

Tree Building ◽

Molecular Homology

Sequence alignment is just as much a part of phylogenetics as is tree building, although it is often viewed solely as a necessary tool to construct trees. However, alignment for the purpose of phylogenetic inference is primarily about homology, as it is the procedure that expresses homology relationships among the characters, rather than the historical relationships of the taxa. Molecular homology is rather vaguely defined and understood, despite its importance in the molecular age. Indeed, homology has rarely been evaluated with respect to nucleotide sequence alignments, in spite of the fact that nucleotides are the only data that directly represent genotype. All other molecular data represent phenotype, just as do morphology and anatomy. Thus, efforts to improve sequence alignment for phylogenetic purposes should involve a more refined use of the homology concept at a molecular level. To this end, we present examples of molecular-data levels at which homology might be considered, and arrange them in a hierarchy. The concept that we propose has many levels, which link directly to the developmental and morphological components of homology. Of note, there is no simple relationship between gene homology and nucleotide homology. We also propose terminology with which to better describe and discuss molecular homology at these levels. Our over-arching conceptual framework is then used to shed light on the multitude of automated procedures that have been created for multiple-sequence alignment. Sequence alignment needs to be based on aligning homologous nucleotides, without necessary reference to homology at any other level of the hierarchy. In particular, inference of nucleotide homology involves deriving a plausible scenario for molecular change among the set of sequences. Our clarifications should allow the development of a procedure that specifically addresses homology, which is required when performing alignment for phylogenetic purposes, but which does not yet exist.

Benchmarking Statistical Multiple Sequence Alignment

10.1101/304659 ◽

2018 ◽

Cited By ~ 1

Author(s):

Michael Nute ◽

Ehsan Saleh ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Structural Alignment ◽

Estimation Method ◽

Simulated Data ◽

Protein Sequences ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Simulated Data Sets

AbstractThe estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical co-estimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy is dramatically more accurate than the other alignment methods on the simulated data sets, but is among the least accurate on the biological benchmarks. There are several potential causes for this discordance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments; future research is needed to understand the most likely explanation for our observations. multiple sequence alignment, BAli-Phy, protein sequences, structural alignment, homology

Large-Scale Multiple Sequence Alignment and Tree Estimation Using SATé

Methods in Molecular Biology - Multiple Sequence Alignment Methods ◽

10.1007/978-1-62703-646-7_15 ◽

2013 ◽

pp. 219-244 ◽

Cited By ~ 14

Author(s):

Kevin Liu ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Large Scale ◽

Multiple Sequence ◽

Tree Estimation

MSAC: Compression of multiple sequence alignment files

10.1101/240341 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz ◽

Joanna Walczyszyn ◽

Agnieszka Debudaj-Grabysz

Keyword(s):

Sequence Alignment ◽

Compression Ratio ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Link Type ◽

Bioinformatics Databases ◽

Supplementary Material ◽

Burrows Wheeler Transform

AbstractMotivationBioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge.ResultsWe propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.AvailabilityMSAC is available for free at https://github.com/refresh-bio/msac and http://sun.aei.polsl.pl/REFRESH/[email protected] materialSupplementary data are available at the publisher Web site.

Multiple sequence alignment: a major challenge to large-scale phylogenetics

PLoS Currents ◽

10.1371/currents.rrn1198 ◽

2011 ◽

Vol 2 ◽

pp. RRN1198 ◽

Cited By ~ 12

Author(s):

Kevin Liu ◽

C. Randal Linder ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Large Scale ◽

Multiple Sequence

SequenceBouncer: A method to remove outlier entries from a multiple sequence alignment

10.1101/2020.11.24.395459 ◽

2020 ◽

Author(s):

Cory D. Dunn

Keyword(s):

Nucleic Acid ◽

Sequence Alignment ◽

Phylogenetic Analyses ◽

Protein Sequences ◽

Mitochondrial Genomes ◽

Dna Barcodes ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

AbstractPhylogenetic analyses can take advantage of multiple sequence alignments as input. These alignments typically consist of homologous nucleic acid or protein sequences, and the inclusion of outlier or aberrant sequences can compromise downstream analyses. Here, I describe a program, SequenceBouncer, that uses the Shannon entropy values of alignment columns to identify outlier alignment sequences in a manner responsive to overall alignment context. I demonstrate the utility of this software using alignments of available mammalian mitochondrial genomes, bird cytochrome c oxidase-derived DNA barcodes, and COVID-19 sequences.

Sequoya: multiobjective multiple sequence alignment in Python

Bioinformatics ◽

10.1093/bioinformatics/btaa257 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3892-3893

Author(s):

Antonio Benítez-Hidalgo ◽

Antonio J Nebro ◽

José F Aldana-Montes

Keyword(s):

Sequence Alignment ◽

Software Tool ◽

Computing System ◽

Optimization Approach ◽

Multiple Sequence ◽

Graphical Tool ◽

Optimal Alignments ◽

Python Programming

Abstract Motivation Multiple sequence alignment (MSA) consists of finding the optimal alignment of three or more biological sequences to identify highly conserved regions that may be the result of similarities and relationships between the sequences. MSA is an optimization problem with NP-hard complexity (non-deterministic polynomial-time hardness), because the time needed to find optimal alignments raises exponentially along with the number of sequences and their length. Furthermore, the problem becomes multiobjective when more than one score is considered to assess the quality of an alignment, such as maximizing the percentage of totally conserved columns and minimizing the number of gaps. Our motivation is to provide a Python tool for solving MSA problems using evolutionary algorithms, a nonexact stochastic optimization approach that has proven to be effective to solve multiobjective problems. Results The software tool we have developed, called Sequoya, is written in the Python programming language, which offers a broad set of libraries for data analysis, visualization and parallelism. Thus, Sequoya offers a graphical tool to visualize the progress of the optimization in real time, the ability to guide the search toward a preferred region in run-time, parallel support to distribute the computation among nodes in a distributed computing system, and a graphical component to assist in the analysis of the solutions found at the end of the optimization. Availability and implementation Sequoya can be freely obtained from the Python Package Index (pip) or, alternatively, it can be downloaded from Github at https://github.com/benhid/Sequoya. Supplementary information Supplementary data are available at Bioinformatics online.

Multi-GPU Approach for Large-Scale Multiple Sequence Alignment

10.1007/978-3-030-86653-2_41 ◽

2021 ◽

pp. 560-575

Author(s):

Rodrigo A. de O. Siqueira ◽

Marco A. Stefanes ◽

Luiz C. S. Rozante ◽

David C. Martins-Jr ◽

Jorge E. S. de Souza ◽

...

Keyword(s):

Sequence Alignment ◽

Large Scale ◽

Multiple Sequence

Protein multiple alignments: sequence-based versus structure-based programs

Bioinformatics ◽

10.1093/bioinformatics/btz236 ◽

2019 ◽

Vol 35 (20) ◽

pp. 3970-3980 ◽

Cited By ~ 6

Author(s):

Mathilde Carpentier ◽

Jacques Chomilier

Keyword(s):

Sequence Alignment ◽

Added Value ◽

Supplementary Data ◽

Sequence Structure ◽

Multiple Sequence ◽

Sequence Identity ◽

Multiple Alignments ◽

Low Levels

Abstract Motivation Multiple sequence alignment programs have proved to be very useful and have already been evaluated in the literature yet not alignment programs based on structure or both sequence and structure. In the present article we wish to evaluate the added value provided through considering structures. Results We compared the multiple alignments resulting from 25 programs either based on sequence, structure or both, to reference alignments deposited in five databases (BALIBASE 2 and 3, HOMSTRAD, OXBENCH and SISYPHUS). On the whole, the structure-based methods compute more reliable alignments than the sequence-based ones, and even than the sequence+structure-based programs whatever the databases. Two programs lead, MAMMOTH and MATRAS, nevertheless the performances of MUSTANG, MATT, 3DCOMB, TCOFFEE+TM_ALIGN and TCOFFEE+SAP are better for some alignments. The advantage of structure-based methods increases at low levels of sequence identity, or for residues in regular secondary structures or buried ones. Concerning gap management, sequence-based programs set less gaps than structure-based programs. Concerning the databases, the alignments of the manually built databases are more challenging for the programs. Availability and implementation All data and results presented in this study are available at: http://wwwabi.snv.jussieu.fr/people/mathilde/download/AliMulComp/. Supplementary information Supplementary data are available at Bioinformatics online.