ViralMSA: Massively scalable reference-guided multiple sequence alignment of viral genomes

Genomic Sequence ◽

Sequence Data ◽

Software Project ◽

Multiple Sequence ◽

Viral Genomes ◽

Alignment Tool ◽

Algorithmic Techniques

AbstractMotivationIn molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment scale poorly with respect to the number of sequences.ResultsViralMSA is a user-friendly reference-guided multiple sequence alignment tool that leverages the algorithmic techniques of read mappers to enable the multiple sequence alignment of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds.AvailabilityViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software [email protected]

ViralMSA: massively scalable reference-guided multiple sequence alignment of viral genomes

Bioinformatics ◽

10.1093/bioinformatics/btaa743 ◽

2020 ◽

Author(s):

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Genomic Sequence ◽

Sequence Data ◽

Supplementary Information ◽

Software Project ◽

Multiple Sequence ◽

Viral Genomes ◽

Algorithmic Techniques ◽

User Friendly

Abstract Motivation In molecular epidemiology, the identification of clusters of transmissions typically requires the alignment of viral genomic sequence data. However, existing methods of multiple sequence alignment (MSA) scale poorly with respect to the number of sequences. Results ViralMSA is a user-friendly reference-guided MSA tool that leverages the algorithmic techniques of read mappers to enable the MSA of ultra-large viral genome datasets. It scales linearly with the number of sequences, and it is able to align tens of thousands of full viral genomes in seconds. However, alignments produced by ViralMSA omit insertions with respect to the reference genome. Availability and implementation ViralMSA is freely available at https://github.com/niemasd/ViralMSA as an open-source software project. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Constrained Multiple Sequence Alignment Tool Development and Its Application to RNase Family Alignment

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720003000095 ◽

2003 ◽

Vol 01 (02) ◽

pp. 267-287 ◽

Cited By ~ 33

Author(s):

Chuan Yi Tang ◽

Chin Lung Lu ◽

Margaret Dah-Tsyr Chang ◽

Yin-Te Tsai ◽

Yuh-Ju Sun ◽

...

Keyword(s):

Sequence Alignment ◽

Heuristic Algorithm ◽

Time Complexity ◽

Software System ◽

Tool Development ◽

Multiple Sequence ◽

Rna Molecules ◽

Alignment Tool ◽

Multiple Sequence Alignment Tool

In this paper, we design a heuristic algorithm of computing a constrained multiple sequence alignment (CMSA for short) for guaranteeing that the generated alignment satisfies the user-specified constraints that some particular residues should be aligned together. If the number of residues needed to be aligned together is a constant α, then the time-complexity of our CMSA algorithm for aligning K sequences is O(αKn4), where n is the maximum of the lengths of sequences. In addition, we have built up such a CMSA software system and made several experiments on the RNase sequences, which mainly function in catalyzing the degradation of RNA molecules. The resulting alignments illustrate the practicability of our method.

TM-Aligner: Multiple sequence alignment tool for transmembrane proteins with reduced time and improved accuracy

Scientific Reports ◽

10.1038/s41598-017-13083-y ◽

2017 ◽

Vol 7 (1) ◽

Cited By ~ 7

Author(s):

Basharat Bhat ◽

Nazir A. Ganai ◽

Syed Mudasir Andrabi ◽

Riaz A. Shah ◽

Ashutosh Singh

Keyword(s):

Sequence Alignment ◽

Transmembrane Proteins ◽

Multiple Sequence ◽

Alignment Tool ◽

Improved Accuracy ◽

Reduced Time

PnpProbs: a better multiple sequence alignment tool by better handling of guide trees

BMC Bioinformatics ◽

10.1186/s12859-016-1121-7 ◽

2016 ◽

Vol 17 (S8) ◽

Author(s):

Yongtao Ye ◽

Tak-Wah Lam ◽

Hing-Fung Ting

Keyword(s):

Sequence Alignment ◽

Multiple Sequence ◽

Alignment Tool ◽

Guide Trees

Match-Box_server: a multiple sequence alignment tool placing emphasis on reliability

Bioinformatics ◽

10.1093/bioinformatics/13.3.249 ◽

1997 ◽

Vol 13 (3) ◽

pp. 249-256 ◽

Cited By ~ 9

Author(s):

Eric Depiereux ◽

Guy Baudoux ◽

Pascal Briffeuil ◽

Isabelle Reginster ◽

Xavier De Bolle ◽

...

Keyword(s):

Sequence Alignment ◽

Multiple Sequence ◽

Alignment Tool ◽

Multiple Sequence Alignment Tool

CUDA-Parttree: A Multiple Sequence Alignment Parallel Strategy in GPU

10.5753/wscad.2019.8662 ◽

2019 ◽

Author(s):

Caina Razzolini ◽

Alba Melo

Keyword(s):

Sequence Alignment ◽

Execution Time ◽

Distance Matrix ◽

Data Conversion ◽

Multiple Sequence ◽

Alignment Tool ◽

Matrix Calculation ◽

Parallel Strategy

In this paper, we propose and evaluate CUDA-Parttree, a parallel strategy that executes the first phase of the MAFFT Parttree Multiple Sequence Alignment tool (distance matrix calculation with 6mers) on GPU. When compared to Parttree, CUDA-Parttree obtained a speedup of 6.10x on the distance matrix calculation for the Cyclodex gly tran (50, 280 sequences) set, reducing the execution time from 33.94s to 5.57s. Including data conversion and movement to/from the GPU, the speedup was 2.59x. With the sequence set Syn 100000 (100, 000 sequences), a speedup of 4.46x was attained, reducing execution time from 209.54s to 47.00s.

ProPIP: a tool for progressive multiple sequence alignment with Poisson Indel Process

BMC Bioinformatics ◽

10.1186/s12859-021-04442-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Massimo Maiolo ◽

Lorenzo Gatti ◽

Diego Frei ◽

Tiziano Leidi ◽

Manuel Gil ◽

...

Keyword(s):

Sequence Alignment ◽

Source Code ◽

Evolutionary Model ◽

Multiple Sequence ◽

Insertions And Deletions ◽

Alignment Tool ◽

Progressive Multiple Sequence Alignment

Biological Interpretation ◽

Abstract Background Current alignment tools typically lack an explicit model of indel evolution, leading to artificially short inferred alignments (i.e., over-alignment) due to inconsistencies between the indel history and the phylogeny relating the input sequences. Results We present a new progressive multiple sequence alignment tool ProPIP. The process of insertions and deletions is described using an explicit evolutionary model—the Poisson Indel Process or PIP. The method is based on dynamic programming and is implemented in a frequentist framework. The source code can be compiled on Linux, macOS and Microsoft Windows platforms. The algorithm is implemented in C++ as standalone program. The source code is freely available on GitHub at https://github.com/acg-team/ProPIP and is distributed under the terms of the GNU GPL v3 license. Conclusions The use of an explicit indel evolution model allows to avoid over-alignment, to infer gaps in a phylogenetically consistent way and to make inferences about the rates of insertions and deletions. Instead of the arbitrary gap penalties, the parameters used by ProPIP are the insertion and deletion rates, which have biological interpretation and are contextualized in a probabilistic environment. As a result, indel rate settings may be optimised in order to infer phylogenetically meaningful gap patterns.

Constrained multiple sequence alignment tool development and its application to RNase family alignment

Proceedings. IEEE Computer Society Bioinformatics Conference ◽

10.1109/csb.2002.1039336 ◽

2003 ◽

Cited By ~ 10

Author(s):

Chuan Yi Tang ◽

Chin Lung Lu ◽

M.D.-T. Chang ◽

Yin-Te Tsai ◽

Yuh-Ju Sun ◽

...

Keyword(s):

Sequence Alignment ◽

Tool Development ◽

Multiple Sequence ◽

Alignment Tool ◽

Multiple Sequence Alignment Tool

MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization

Briefings in Bioinformatics ◽

10.1093/bib/bbx108 ◽

2017 ◽

Vol 20 (4) ◽

pp. 1160-1166 ◽

Cited By ~ 989

Author(s):

Kazutaka Katoh ◽

John Rozewicki ◽

Kazunori D Yamada

Keyword(s):

Sequence Alignment ◽

Sequence Data ◽

Large Data ◽

Relevant Information ◽

Data Sets ◽

Online Service ◽

Multiple Sequence ◽

Biologically Relevant ◽

Sequencing Technologies

Abstract This article describes several features in the MAFFT online service for multiple sequence alignment (MSA). As a result of recent advances in sequencing technologies, huge numbers of biological sequences are available and the need for MSAs with large numbers of sequences is increasing. To extract biologically relevant information from such data, sophistication of algorithms is necessary but not sufficient. Intuitive and interactive tools for experimental biologists to semiautomatically handle large data are becoming important. We are working on development of MAFFT toward these two directions. Here, we explain (i) the Web interface for recently developed options for large data and (ii) interactive usage to refine sequence data sets and MSAs.

An Evaluation of Phylogenetic Workflows in Viral Molecular Epidemiology

10.1101/2020.11.24.396820 ◽

2020 ◽

Author(s):

Colin Young ◽

Sarah Meng ◽

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Sequence Data ◽

Phylogenetic Inference ◽

The Other ◽

Computational Techniques ◽

Viral Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Branch Lengths

AbstractThe use of computational techniques to analyze viral sequence data and ultimately inform public health intervention has become increasingly common in the realm of epidemiology. These methods typically attempt to make epidemiological inferences based on multiple sequence alignments and phylogenies estimated from the raw sequence data. Like all estimation techniques, multiple sequence alignment and phylogenetic inference tools are error-prone, and the impacts of such imperfections on downstream epidemiological inferences are poorly understood. To address this, we executed multiple commonly-used workflows for conducting viral phylogenetic analyses on simulated viral sequence data modeling HIV, HCV, and Ebola, and we computed multiple methods of accuracy motivated by transmission clustering techniques. For multiple sequence alignment, MAFFT consistently outperformed MUSCLE and Clustal Omega in both accuracy and runtime. For phylogenetic inference, FastTree 2, IQ-TREE, RAxML-NG, and PhyML had similar topological accuracies, but branch lengths and pairwise distances were consistently most accurate in phylogenies inferred by RAxML-NG. However, FastTree 2 was orders of magnitude faster than the other tools, and when the other tools were used to optimize branch lengths along a fixed topology provided by FastTree 2 (i.e., no tree search), the resulting phylogenies had accuracies that were indistinguishable from their original counterparts, but with a fraction of the runtime. Our results indicate that an ideal workflow for viral phylogenetic inference is to (1) use MAFFT to perform MSA, (2) use FastTree 2 under the GTR model with discrete gamma-distributed site-rate heterogeneity to quickly obtain a reasonable tree topology, and (3) use RAxML-NG to optimize branch lengths along the fixed FastTree 2 topology.