Kalign 3: multiple sequence alignment of large datasets

Original Design ◽

Multiple Sequence ◽

Design Specifications ◽

Multiple Data ◽

Large Numbers ◽

Embedding Strategy ◽

Guide Trees ◽

Abstract Motivation Kalign is an efficient multiple sequence alignment (MSA) program capable of aligning thousands of protein or nucleotide sequences. However, current alignment problems involving large numbers of sequences are exceeding Kalign’s original design specifications. Here we present a completely re-written and updated version to meet current and future alignment challenges. Results Kalign now uses a SIMD (single instruction, multiple data) accelerated version of the bit-parallel Gene Myers algorithm to estimate pairwise distances, adopts a sequence embedding strategy and the bi-secting K-means algorithm to rapidly construct guide trees for thousands of sequences. The new version maintains high alignment accuracy on both protein and nucleotide alignments and scales better than other MSA tools. Availability and implementation The source code of Kalign and code to reproduce the results are found here: https://github.com/timolassmann/kalign. Contact [email protected]

IFIP Advances in Information and Communication Technology - Computer Science and Its Applications ◽

Multiple Guide Trees in a Tabu Search Algorithm for the Multiple Sequence Alignment Problem

10.1007/978-3-319-19578-0_12 ◽

2015 ◽

pp. 141-152 ◽

Cited By ~ 1

Author(s):

Tahar Mehenni

Keyword(s):

Tabu Search ◽

Sequence Alignment ◽

Search Algorithm ◽

Tabu Search Algorithm ◽

Multiple Sequence ◽

Alignment Problem ◽

Improving multiple sequence alignment by using better guide trees

BMC Bioinformatics ◽

10.1186/1471-2105-16-s5-s4 ◽

2015 ◽

Vol 16 (Suppl 5) ◽

pp. S4 ◽

Cited By ~ 2

Author(s):

Qing Zhan ◽

Yongtao Ye ◽

Tak-Wah Lam ◽

Siu-Ming Yiu ◽

Yadong Wang ◽

...

Keyword(s):

Sequence Alignment ◽

Multiple Sequence ◽

Sequence embedding for fast construction of guide trees for multiple sequence alignment

Algorithms for Molecular Biology ◽

10.1186/1748-7188-5-21 ◽

2010 ◽

Vol 5 (1) ◽

pp. 21 ◽

Cited By ~ 67

Author(s):

Gordon Blackshields ◽

Fabian Sievers ◽

Weifeng Shi ◽

Andreas Wilm ◽

Desmond G Higgins

Keyword(s):

Sequence Alignment ◽

Multiple Sequence ◽

Data Distribution for Phylogenetic Inference with Site Repeats via Judicious Hypergraph Partitioning

10.1101/579318 ◽

2019 ◽

Author(s):

Ivo Baar ◽

Lukas Hübner ◽

Peter Oettig ◽

Adrian Zapletal ◽

Sebastian Schlag ◽

...

Keyword(s):

Sequence Alignment ◽

Likelihood Function ◽

Data Distribution ◽

Phylogenetic Inference ◽

Multiple Sequence ◽

Hypergraph Partitioning ◽

Partitioning Problem ◽

The Cost ◽

AbstractThe so-called site repeats (SR) technique can be used to accelerate the widely-used phylogenetic likelihood function (PLF) by identifying identical patterns among multiple sequence alignment (MSA) sites, thereby omitting redundant calculations and saving memory. However, this complicates the optimal data distribution of MSA sites in parallel likelihood calculations, as the cost of computing the likelihood for individual sites strongly depends on the sites-to-cores assignment. We show that finding a ‘good’ sites-to-cores assignment can be modeled as a hypergraph partitioning problem, more specifically, a specific instance of the so-called judicious hypergraph partitioning problem. We initially develop, parallelize, and make available HyperPhylo, an efficient open-source implementation for this flavor of judicious partitioning where all vertices have the same degree. Using empirical MSA data, we then show that sites-to-core assignments computed via HyperPhylo are substantially better than those obtained via a previous na ï ve approach for phylogenetic data distribution under SRs.

PnpProbs: a better multiple sequence alignment tool by better handling of guide trees

BMC Bioinformatics ◽

10.1186/s12859-016-1121-7 ◽

2016 ◽

Vol 17 (S8) ◽

Author(s):

Yongtao Ye ◽

Tak-Wah Lam ◽

Hing-Fung Ting

Keyword(s):

Sequence Alignment ◽

Multiple Sequence ◽

Alignment Tool ◽

Multiple Sequence Alignment Tool ◽

Performance Comparison of MPI-Based Parallel Multiple Sequence Alignment Algorithm Using Single and Multiple Guide Trees

2006 5th IEEE International Conference on Cognitive Informatics ◽

10.1109/coginf.2006.365552 ◽

2006 ◽

Author(s):

Siamak Rezaei ◽

Md. Maruf Monwar ◽

Joanne Bai

Keyword(s):

Sequence Alignment ◽

Performance Comparison ◽

Alignment Algorithm ◽

Multiple Sequence ◽

Sequence Alignment Algorithm ◽

A Comparative Analysis of Progressive Multiple Sequence Alignment Approaches using UPGMA and Neighbor Join Based Guide Trees

International Journal of Computer Science Engineering and Information Technology ◽

10.5121/ijcseit.2015.5401 ◽

2015 ◽

Vol 5 (3/4) ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Dega Ravi Kumar Yadav ◽

Gunes Ercal

Keyword(s):

Comparative Analysis ◽

Sequence Alignment ◽

Neighbor Join ◽

Multiple Sequence ◽

Progressive Multiple Sequence Alignment ◽

Recursive MAGUS: Scalable and accurate multiple sequence alignment

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008950 ◽

2021 ◽

Vol 17 (10) ◽

pp. e1008950

Author(s):

Vladimir Smirnov

Keyword(s):

Open Source ◽

Sequence Alignment ◽

State Of The Art ◽

Sequence Data ◽

Large Datasets ◽

Alignment Accuracy ◽

Multiple Sequence ◽

Large Numbers ◽

Source Form

Multiple sequence alignment tools struggle to keep pace with rapidly growing sequence data, as few methods can handle large datasets while maintaining alignment accuracy. We recently introduced MAGUS, a new state-of-the-art method for aligning large numbers of sequences. In this paper, we present a comprehensive set of enhancements that allow MAGUS to align vastly larger datasets with greater speed. We compare MAGUS to other leading alignment methods on datasets of up to one million sequences. Our results demonstrate the advantages of MAGUS over other alignment software in both accuracy and speed. MAGUS is freely available in open-source form at https://github.com/vlasmirnov/MAGUS.

Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmark

International Journal of Applied Mathematics and Computer Science ◽

10.2478/v10006-009-0054-y ◽

2009 ◽

Vol 19 (4) ◽

pp. 675-678 ◽

Cited By ~ 5

Author(s):

Jacek Błażewicz ◽

Piotr Formanowicz ◽

Paweł Wojciechowski

Keyword(s):

Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Formal Definitions ◽

Accuracy Measures ◽

Total Column ◽

Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmarkBAliBASE is one of the most widely used benchmarks for multiple sequence alignment programs. The accuracy of alignment methods is measured bybali_score—an application provided together with the database. The standard accuracy measures are the Sum of Pairs (SP) and the Total Column (TC). We have found that, for non-core block columns, results calculated bybali_scoreare different from those obtained on the basis of the formal definitions of the measures. We do not claim that one of these measures is better than the other, but they are definitely different. Such a situation can be the source of confusion when alignments obtained using various methods are compared. Therefore, we propose a new nomenclature for the measures of the quality of multiple sequence alignments to distinguish which one was actually calculated. Moreover, we have found that the occurrence of a gap in some column in the first sequence of the reference alignment causes column discarding.

Sequence Alignment with Q-Learning Based on the Actor-Critic Model

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3433540 ◽

2021 ◽

Vol 20 (5) ◽

pp. 1-7

Author(s):

Yarong Li

Keyword(s):

Dynamic Programming ◽

Sequence Alignment ◽

Learning Process ◽

Dynamic Programming Method ◽

Multiple Sequence ◽

Q Learning ◽

Entire Process ◽

Alignment Problem ◽

Multiple sequence alignment methods refer to a series of algorithmic solutions for the alignment of evolutionary-related sequences while taking into account evolutionary events such as mutations, insertions, deletions, and rearrangements under certain conditions. In this article, we propose a method with Q-learning based on the Actor-Critic model for sequence alignment. We transform the sequence alignment problem into an agent's autonomous learning process. In this process, the reward of the possible next action taken is calculated, and the cumulative reward of the entire process is calculated. The results show that the method we propose is better than the gene algorithm and the dynamic programming method.