MergeAlign: improving multiple sequence alignment performance by dynamic reconstruction of consensus multiple sequence alignments

AbstractMotivationBioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge.ResultsWe propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.AvailabilityMSAC is available for free at https://github.com/refresh-bio/msac and http://sun.aei.polsl.pl/REFRESH/[email protected] materialSupplementary data are available at the publisher Web site.

Download Full-text

SequenceBouncer: A method to remove outlier entries from a multiple sequence alignment

10.1101/2020.11.24.395459 ◽

2020 ◽

Author(s):

Cory D. Dunn

Keyword(s):

Nucleic Acid ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Analyses ◽

Protein Sequences ◽

Mitochondrial Genomes ◽

Dna Barcodes ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

AbstractPhylogenetic analyses can take advantage of multiple sequence alignments as input. These alignments typically consist of homologous nucleic acid or protein sequences, and the inclusion of outlier or aberrant sequences can compromise downstream analyses. Here, I describe a program, SequenceBouncer, that uses the Shannon entropy values of alignment columns to identify outlier alignment sequences in a manner responsive to overall alignment context. I demonstrate the utility of this software using alignments of available mammalian mitochondrial genomes, bird cytochrome c oxidase-derived DNA barcodes, and COVID-19 sequences.

Download Full-text

ClipKIT: a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference

10.1101/2020.06.08.140384 ◽

2020 ◽

Cited By ~ 3

Author(s):

Jacob L. Steenwyk ◽

Thomas J. Buida ◽

Yuanning Li ◽

Xing-Xing Shen ◽

Antonis Rokas

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Inference ◽

Recent Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Time Saving ◽

Multiple Sequence Alignments

AbstractHighly divergent sites in multiple sequence alignments, which stem from erroneous inference of homology and saturation of substitutions, are thought to negatively impact phylogenetic inference. Trimming methods aim to remove these sites before phylogenetic inference, but recent analysis suggests that doing so can worsen inference. We introduce ClipKIT, a trimming method that instead aims to retain phylogenetically-informative sites; phylogenetic inference using ClipKIT-trimmed alignments is accurate, robust, and time-saving.

Download Full-text

fastMSA: Accelerating Multiple Sequence Alignment with Dense Retrieval on Protein Language

10.1101/2021.12.20.473431 ◽

2021 ◽

Author(s):

Liang Hong ◽

Siqi Sun ◽

Liangzhen Zheng ◽

Qingxiong Tan ◽

Yu Li

Keyword(s):

Protein Structure ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structure Prediction ◽

Structure And Function ◽

Sequence Alignments ◽

Protein Structure And Function ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

And Function

Evolutionarily related sequences provide information for the protein structure and function. Multiple sequence alignment, which includes homolog searching from large databases and sequence alignment, is efficient to dig out the information and assist protein structure and function prediction, whose efficiency has been proved by AlphaFold. Despite the existing tools for multiple sequence alignment, searching homologs from the entire UniProt is still time-consuming. Considering the success of AlphaFold, foreseeably, large- scale multiple sequence alignments against massive databases will be a trend in the field. It is very desirable to accelerate this step. Here, we propose a novel method, fastMSA, to improve the speed significantly. Our idea is orthogonal to all the previous accelerating methods. Taking advantage of the protein language model based on BERT, we propose a novel dual encoder architecture that can embed the protein sequences into a low-dimension space and filter the unrelated sequences efficiently before running BLAST. Extensive experimental results suggest that we can recall most of the homologs with a 34-fold speed-up. Moreover, our method is compatible with the downstream tasks, such as structure prediction using AlphaFold. Using multiple sequence alignments generated from our method, we have little performance compromise on the protein structure prediction with much less running time. fastMSA will effectively assist protein sequence, structure, and function analysis based on homologs and multiple sequence alignment.

Download Full-text

Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmark

International Journal of Applied Mathematics and Computer Science ◽

10.2478/v10006-009-0054-y ◽

2009 ◽

Vol 19 (4) ◽

pp. 675-678 ◽

Cited By ~ 5

Author(s):

Jacek Błażewicz ◽

Piotr Formanowicz ◽

Paweł Wojciechowski

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Formal Definitions ◽

Accuracy Measures ◽

Total Column ◽

Better Than

Some remarks on evaluating the quality of the multiple sequence alignment based on the BAliBASE benchmarkBAliBASE is one of the most widely used benchmarks for multiple sequence alignment programs. The accuracy of alignment methods is measured bybali_score—an application provided together with the database. The standard accuracy measures are the Sum of Pairs (SP) and the Total Column (TC). We have found that, for non-core block columns, results calculated bybali_scoreare different from those obtained on the basis of the formal definitions of the measures. We do not claim that one of these measures is better than the other, but they are definitely different. Such a situation can be the source of confusion when alignments obtained using various methods are compared. Therefore, we propose a new nomenclature for the measures of the quality of multiple sequence alignments to distinguish which one was actually calculated. Moreover, we have found that the occurrence of a gap in some column in the first sequence of the reference alignment causes column discarding.

Download Full-text

SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically

Protein and Peptide Letters ◽

10.2174/0929866526666190806143959 ◽

2020 ◽

Vol 27 (4) ◽

pp. 295-302 ◽

Cited By ~ 1

Author(s):

Qing Zhan ◽

Yilei Fu ◽

Qinghua Jiang ◽

Bo Liu ◽

Jiajie Peng ◽

...

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Middle Part ◽

The Other ◽

Initial Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Refinement Method ◽

Refinement Strategy

Background: Multiple Sequence Alignment (MSA) is a fundamental task in bioinformatics and is required for many biological analysis tasks. The more accurate the alignments are, the more credible the downstream analyses. Most protein MSA algorithms realign an alignment to refine it by dividing it into two groups horizontally and then realign the two groups. However, this strategy does not consider that different regions of the sequences have different conservation; this property may lead to incorrect residue-residue or residue-gap pairs, which cannot be corrected by this strategy. Objective: In this article, our motivation is to develop a novel refinement method based on splitting- splicing vertically. Method: Here, we present a novel refinement method based on splitting-splicing vertically, called SpliVert. For an alignment, we split it vertically into 3 parts, remove the gap characters in the middle, realign the middle part alone, and splice the realigned middle parts with the other two initial pieces to obtain a refined alignment. In the realign procedure of our method, the aligner will only focus on a certain part, ignoring the disturbance of the other parts, which could help fix the incorrect pairs. Results: We tested our refinement strategy for 2 leading MSA tools on 3 standard benchmarks, according to the commonly used average SP (and TC) score. The results show that given appropriate proportions to split the initial alignment, the average scores are increased comparably or slightly after using our method. We also compared the alignments refined by our method with alignments directly refined by the original alignment tools. The results suggest that using our SpliVert method to refine alignments can also outperform direct use of the original alignment tools. Conclusion: The results reveal that splitting vertically and realigning part of the alignment is a good strategy for the refinement of protein multiple sequence alignments.

Download Full-text

ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference

PLoS Biology ◽

10.1371/journal.pbio.3001007 ◽

2020 ◽

Vol 18 (12) ◽

pp. e3001007

Author(s):

Jacob L. Steenwyk ◽

Thomas J. Buida ◽

Yuanning Li ◽

Xing-Xing Shen ◽

Antonis Rokas

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Inference ◽

Sequence Alignments ◽

Multiple Sequence ◽

Time Saving ◽

Multiple Sequence Alignments ◽

Alternative Alignment ◽

Robust Framework

Highly divergent sites in multiple sequence alignments (MSAs), which can stem from erroneous inference of homology and saturation of substitutions, are thought to negatively impact phylogenetic inference. Thus, several different trimming strategies have been developed for identifying and removing these sites prior to phylogenetic inference. However, a recent study reported that doing so can worsen inference, underscoring the need for alternative alignment trimming strategies. Here, we introduce ClipKIT, an alignment trimming software that, rather than identifying and removing putatively phylogenetically uninformative sites, instead aims to identify and retain parsimony-informative sites, which are known to be phylogenetically informative. To test the efficacy of ClipKIT, we examined the accuracy and support of phylogenies inferred from 14 different alignment trimming strategies, including those implemented in ClipKIT, across nearly 140,000 alignments from a broad sampling of evolutionary histories. Phylogenies inferred from ClipKIT-trimmed alignments are accurate, robust, and time saving. Furthermore, ClipKIT consistently outperformed other trimming methods across diverse datasets, suggesting that strategies based on identifying and retaining parsimony-informative sites provide a robust framework for alignment trimming.

Download Full-text

Tailor-made multiple sequence alignments using the PRALINE 2 alignment toolkit

Bioinformatics ◽

10.1093/bioinformatics/btz572 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5315-5317 ◽

Cited By ~ 1

Author(s):

Maurits J J Dijkstra ◽

Atze J van der Ploeg ◽

K Anton Feenstra ◽

Wan J Fokkink ◽

Sanne Abeln ◽

...

Keyword(s):

Secondary Structure ◽

Open Source ◽

Sequence Alignment ◽

Open Source Software ◽

Multiple Sequence Alignment ◽

Multiple Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Dna Motifs ◽

Multiple Sequence Alignments

Abstract Summary PRALINE 2 is a toolkit for custom multiple sequence alignment workflows. It can be used to incorporate sequence annotations, such as secondary structure or (DNA) motifs, into the alignment scoring, as well as to customize many other aspects of a progressive multiple alignment workflow. Availability and implementation PRALINE 2 is implemented in Python and available as open source software on GitHub: https://github.com/ibivu/PRALINE/.

Download Full-text

Inferring an Original Sequence from Erroneous Copies: Two Approaches

Asia-Pacific Biotech News ◽

10.1142/s0219030303000284 ◽

2003 ◽

Vol 07 (03) ◽

pp. 107-114 ◽

Cited By ~ 3

Author(s):

Jonathan M. Keith ◽

Peter Adams ◽

Darryn Bryant ◽

Keith R. Mitchelson ◽

Duncan A. E. Cochran ◽

...

Keyword(s):

Sequence Alignment ◽

Dna Sequences ◽

Multiple Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Original Sequence ◽

Multiple Sequence Alignments ◽

Sequencing Errors ◽

The Cost ◽

New Algorithms

This paper considers the problem of inferring an original sequence from a number of erroneous copies. The problem arises in DNA sequencing, particularly in the context of emerging technologies that provide high throughput or other advantages at the cost of an increased number of errors. We describe and compare two approaches that have recently been developed by the authors. The first approach searches for a sequence known as a Steiner string; the second searches for the most probable original sequence with respect to a simple Bayesian model of sequencing errors. We present the results of extensive tests in which erroneous copies of real DNA sequences were simulated and the algorithms were used to infer the original sequences. The results are used to compare the two approaches to each other and to a third, more conventional, approach based on multiple sequence alignment. We find that the Bayesian approach is superior to the Steiner approach, which in turn is superior to the alignment approach. The two new algorithms can also be used to construct multiple sequence alignments. We show that the two methods produce alignments of approximately equal quality, and conclude that the Steiner approach is better for this purpose because it is faster. Both methods produce better alignments than a well-known multiple sequence alignment package, for the cases tested.

Download Full-text

Molecular homology and multiple-sequence alignment: an analysis of concepts and practice

Australian Systematic Botany ◽

10.1071/sb15001 ◽

2015 ◽

Vol 28 (1) ◽

pp. 46 ◽

Cited By ~ 20

Author(s):

David A. Morrison ◽

Matthew J. Morgan ◽

Scot A. Kelchner

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Molecular Data ◽

Simple Relationship ◽

Sequence Alignments ◽

Multiple Sequence ◽

Molecular Change ◽

Nucleotide Homology ◽

Tree Building ◽

Molecular Homology

Sequence alignment is just as much a part of phylogenetics as is tree building, although it is often viewed solely as a necessary tool to construct trees. However, alignment for the purpose of phylogenetic inference is primarily about homology, as it is the procedure that expresses homology relationships among the characters, rather than the historical relationships of the taxa. Molecular homology is rather vaguely defined and understood, despite its importance in the molecular age. Indeed, homology has rarely been evaluated with respect to nucleotide sequence alignments, in spite of the fact that nucleotides are the only data that directly represent genotype. All other molecular data represent phenotype, just as do morphology and anatomy. Thus, efforts to improve sequence alignment for phylogenetic purposes should involve a more refined use of the homology concept at a molecular level. To this end, we present examples of molecular-data levels at which homology might be considered, and arrange them in a hierarchy. The concept that we propose has many levels, which link directly to the developmental and morphological components of homology. Of note, there is no simple relationship between gene homology and nucleotide homology. We also propose terminology with which to better describe and discuss molecular homology at these levels. Our over-arching conceptual framework is then used to shed light on the multitude of automated procedures that have been created for multiple-sequence alignment. Sequence alignment needs to be based on aligning homologous nucleotides, without necessary reference to homology at any other level of the hierarchy. In particular, inference of nucleotide homology involves deriving a plausible scenario for molecular change among the set of sequences. Our clarifications should allow the development of a procedure that specifically addresses homology, which is required when performing alignment for phylogenetic purposes, but which does not yet exist.

Download Full-text