Recursive MAGUS: scalable and accurate multiple sequence alignment

Multiple sequence alignment tools struggle to keep pace with rapidly growing sequence data, as few methods can handle large datasets while maintaining alignment accuracy. We recently introduced MAGUS, a new state-of-the-art method for aligning large numbers of sequences. In this paper, we present a comprehensive set of enhancements that allow MAGUS to align vastly larger datasets with greater speed. We compare MAGUS to other leading alignment methods on datasets of up to one million sequences. Our results demonstrate the advantages of MAGUS over other alignment software in both accuracy and speed. MAGUS is freely available in open-source form at https://github.com/vlasmirnov/MAGUS.

Download Full-text

TCS: A New Multiple Sequence Alignment Reliability Measure to Estimate Alignment Accuracy and Improve Phylogenetic Tree Reconstruction

Molecular Biology and Evolution ◽

10.1093/molbev/msu117 ◽

2014 ◽

Vol 31 (6) ◽

pp. 1625-1637 ◽

Cited By ~ 113

Author(s):

Jia-Ming Chang ◽

Paolo Di Tommaso ◽

Cedric Notredame

Keyword(s):

Phylogenetic Tree ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Alignment Accuracy ◽

Tree Reconstruction ◽

Multiple Sequence ◽

Reliability Measure ◽

Phylogenetic Tree Reconstruction

Download Full-text

Multiple Sequence Alignment Accuracy and Phylogenetic Inference

Systematic Biology ◽

10.1080/10635150500541730 ◽

2006 ◽

Vol 55 (2) ◽

pp. 314-328 ◽

Cited By ~ 146

Author(s):

T Heath Ogden ◽

Michael S Rosenberg

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Inference ◽

Alignment Accuracy ◽

Multiple Sequence

Download Full-text

A Survey of the State-of-the-Art Parallel Multiple Sequence Alignment Algorithms on Multicore Systems

International Journal of Computer Applications ◽

10.5120/ijca2018917658 ◽

2018 ◽

Vol 182 (12) ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Sara Shehab ◽

Sameh Abdulah ◽

Arabi E.

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

State Of The Art ◽

The State ◽

Multicore Systems ◽

Multiple Sequence ◽

Alignment Algorithms

Download Full-text

ProgSIO-MSA: Progressive-based single iterative optimization framework for multiple sequence alignment using an effective scoring system

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720020500055 ◽

2020 ◽

Vol 18 (02) ◽

pp. 2050005

Author(s):

Sanjay Bankapur ◽

Nagamma Patil

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Scoring System ◽

State Of The Art ◽

Biological Sequences ◽

Alignment Quality ◽

Multiple Sequence ◽

Iterative Optimization ◽

Optimization Framework ◽

Proposed Model

Aligning more than two biological sequences is termed multiple sequence alignment (MSA). To analyze biological sequences, MSA is one of the primary activities with potential applications in phylogenetics, homology markers, protein structure prediction, gene regulation, and drug discovery. MSA problem is considered as NP-complete. Moreover, with the advancement of Next-Generation Sequencing techniques, all the gene and protein databases are consistently loaded with a vast amount of raw sequence data which are neither analyzed nor annotated. To analyze these growing volumes of raw sequences, the need of computationally-efficient (polynomial time) models with accurate alignment is high. In this study, a progressive-based alignment model is proposed, named ProgSIO-MSA, which consists of an effective scoring system and an optimization framework. The proposed scoring system aligns sequences effectively using the combination of two scoring strategies, i.e. Look Back Ahead, that scores a residue pair dynamically based on the status information of the previous position to improve the sum-of-pair score, and Position-Residue-Specific Dynamic Gap Penalty, that dynamically penalizes a gap using mutation matrix on the basis of residue and its position information. The proposed single iterative optimization (SIO) framework identifies and optimizes the local optima trap to improve the alignment quality. The proposed model is evaluated against progressive-based state-of-the-art models on two benchmark datasets, i.e. BAliBASE and SABmark. The alignment quality (biological accuracy) of the proposed model is increased by a factor of 17.7% on BAliBASE dataset. The proposed model’s efficiency is compared with state-of-the-art models using time complexity as well as runtime analysis. Wilcoxon signed-rank statistical test results concluded that the quality of the proposed model significantly outperformed progressive-based state-of-the-art models.

Download Full-text

MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization

Briefings in Bioinformatics ◽

10.1093/bib/bbx108 ◽

2017 ◽

Vol 20 (4) ◽

pp. 1160-1166 ◽

Cited By ~ 989

Author(s):

Kazutaka Katoh ◽

John Rozewicki ◽

Kazunori D Yamada

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Data ◽

Large Data ◽

Relevant Information ◽

Data Sets ◽

Online Service ◽

Multiple Sequence ◽

Biologically Relevant ◽

Sequencing Technologies

Abstract This article describes several features in the MAFFT online service for multiple sequence alignment (MSA). As a result of recent advances in sequencing technologies, huge numbers of biological sequences are available and the need for MSAs with large numbers of sequences is increasing. To extract biologically relevant information from such data, sophistication of algorithms is necessary but not sufficient. Intuitive and interactive tools for experimental biologists to semiautomatically handle large data are becoming important. We are working on development of MAFFT toward these two directions. Here, we explain (i) the Web interface for recently developed options for large data and (ii) interactive usage to refine sequence data sets and MSAs.

Download Full-text

An Evaluation of Phylogenetic Workflows in Viral Molecular Epidemiology

10.1101/2020.11.24.396820 ◽

2020 ◽

Author(s):

Colin Young ◽

Sarah Meng ◽

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Data ◽

Phylogenetic Inference ◽

The Other ◽

Computational Techniques ◽

Viral Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Branch Lengths

AbstractThe use of computational techniques to analyze viral sequence data and ultimately inform public health intervention has become increasingly common in the realm of epidemiology. These methods typically attempt to make epidemiological inferences based on multiple sequence alignments and phylogenies estimated from the raw sequence data. Like all estimation techniques, multiple sequence alignment and phylogenetic inference tools are error-prone, and the impacts of such imperfections on downstream epidemiological inferences are poorly understood. To address this, we executed multiple commonly-used workflows for conducting viral phylogenetic analyses on simulated viral sequence data modeling HIV, HCV, and Ebola, and we computed multiple methods of accuracy motivated by transmission clustering techniques. For multiple sequence alignment, MAFFT consistently outperformed MUSCLE and Clustal Omega in both accuracy and runtime. For phylogenetic inference, FastTree 2, IQ-TREE, RAxML-NG, and PhyML had similar topological accuracies, but branch lengths and pairwise distances were consistently most accurate in phylogenies inferred by RAxML-NG. However, FastTree 2 was orders of magnitude faster than the other tools, and when the other tools were used to optimize branch lengths along a fixed topology provided by FastTree 2 (i.e., no tree search), the resulting phylogenies had accuracies that were indistinguishable from their original counterparts, but with a fraction of the runtime. Our results indicate that an ideal workflow for viral phylogenetic inference is to (1) use MAFFT to perform MSA, (2) use FastTree 2 under the GTR model with discrete gamma-distributed site-rate heterogeneity to quickly obtain a reasonable tree topology, and (3) use RAxML-NG to optimize branch lengths along the fixed FastTree 2 topology.

Download Full-text

ViralMSA: Massively scalable reference-guided multiple sequence alignment of viral genomes

10.1101/2020.04.20.052068 ◽

2020 ◽

Cited By ~ 1

Author(s):

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Genomic Sequence ◽

Sequence Data ◽

Software Project ◽

Multiple Sequence ◽

Viral Genomes ◽

Alignment Tool ◽

Multiple Sequence Alignment Tool ◽

Algorithmic Techniques

AbstractMotivationIn molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment scale poorly with respect to the number of sequences.ResultsViralMSA is a user-friendly reference-guided multiple sequence alignment tool that leverages the algorithmic techniques of read mappers to enable the multiple sequence alignment of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds.AvailabilityViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software [email protected]

Download Full-text

Multiple Sequence Alignment Averaging Improves Phylogeny Reconstruction

Systematic Biology ◽

10.1093/sysbio/syy036 ◽

2018 ◽

Vol 68 (1) ◽

pp. 117-130 ◽

Cited By ~ 9

Author(s):

Haim Ashkenazy ◽

Itamar Sela ◽

Eli Levy Karin ◽

Giddy Landan ◽

Tal Pupko

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Data ◽

Phylogenetic Signal ◽

Large Set ◽

Multiple Sequence ◽

Extra Effort ◽

Alignment Algorithms ◽

Tree Inference ◽

Alignment Errors

Abstract The classic methodology of inferring a phylogenetic tree from sequence data is composed of two steps. First, a multiple sequence alignment (MSA) is computed. Then, a tree is reconstructed assuming the MSA is correct. Yet, inferred MSAs were shown to be inaccurate and alignment errors reduce tree inference accuracy. It was previously proposed that filtering unreliable alignment regions can increase the accuracy of tree inference. However, it was also demonstrated that the benefit of this filtering is often obscured by the resulting loss of phylogenetic signal. In this work we explore an approach, in which instead of relying on a single MSA, we generate a large set of alternative MSAs and concatenate them into a single SuperMSA. By doing so, we account for phylogenetic signals contained in columns that are not present in the single MSA computed by alignment algorithms. Using simulations, we demonstrate that this approach results, on average, in more accurate trees compared to 1) using an unfiltered MSA and 2) using a single MSA with weights assigned to columns according to their reliability. Next, we explore in which regions of the MSA space our approach is expected to be beneficial. Finally, we provide a simple criterion for deciding whether or not the extra effort of computing a SuperMSA and inferring a tree from it is beneficial. Based on these assessments, we expect our methodology to be useful for many cases in which diverged sequences are analyzed. The option to generate such a SuperMSA is available at http://guidance.tau.ac.il.

Download Full-text

ViralMSA: massively scalable reference-guided multiple sequence alignment of viral genomes

Bioinformatics ◽

10.1093/bioinformatics/btaa743 ◽

2020 ◽

Author(s):

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Genomic Sequence ◽

Sequence Data ◽

Supplementary Information ◽

Software Project ◽

Multiple Sequence ◽

Viral Genomes ◽

Algorithmic Techniques ◽

User Friendly

Abstract Motivation In molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment (MSA) scale poorly with respect to the number of sequences. Results ViralMSA is a user-friendly reference-guided MSA tool that leverages the algorithmic techniques of read mappers to enable the MSA of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds. However, alignments produced by ViralMSA omit insertions with respect to the reference genome. Availability and implementation ViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software project. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text