Progressive Multiple Sequence Alignment with the Poisson Indel Process

AbstractSequence alignment lies at the heart of many evolutionary and comparative genomics studies. However, the optimal alignment of multiple sequences is NP-hard, so that exact algorithms become impractical for more than a few sequences. Thus, state of the art alignment methods employ progressive heuristics, breaking the problem into a series of pairwise alignments guided by a phylogenetic tree. Changes between homologous characters are typically modelled by a continuous-time Markov substitution model. In contrast, the dynamics of insertions and deletions (indels) are not modelled explicitly, because the computation of the marginal likelihood under such models has exponential time complexity in the number of taxa. Recently, Bouchard-Côté and Jordan [PNAS (2012) 110(4):1160–1166] have introduced a modification to a classical indel model, describing indel evolution on a phylogenetic tree as a Poisson process. The model termed PIP allows to compute the joint marginal probability of a multiple sequence alignment and a tree in linear time. Here, we present an new dynamic programming algorithm to align two multiple sequence alignments by maximum likelihood in polynomial time under PIP, and apply it a in progressive algorithm. To our knowledge, this is the first progressive alignment method using a rigorous mathematical formulation of an evolutionary indel process and with polynomial time complexity.

Download Full-text

Molecular homology and multiple-sequence alignment: an analysis of concepts and practice

Australian Systematic Botany ◽

10.1071/sb15001 ◽

2015 ◽

Vol 28 (1) ◽

pp. 46 ◽

Cited By ~ 20

Author(s):

David A. Morrison ◽

Matthew J. Morgan ◽

Scot A. Kelchner

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Molecular Data ◽

Simple Relationship ◽

Sequence Alignments ◽

Multiple Sequence ◽

Molecular Change ◽

Nucleotide Homology ◽

Tree Building ◽

Molecular Homology

Sequence alignment is just as much a part of phylogenetics as is tree building, although it is often viewed solely as a necessary tool to construct trees. However, alignment for the purpose of phylogenetic inference is primarily about homology, as it is the procedure that expresses homology relationships among the characters, rather than the historical relationships of the taxa. Molecular homology is rather vaguely defined and understood, despite its importance in the molecular age. Indeed, homology has rarely been evaluated with respect to nucleotide sequence alignments, in spite of the fact that nucleotides are the only data that directly represent genotype. All other molecular data represent phenotype, just as do morphology and anatomy. Thus, efforts to improve sequence alignment for phylogenetic purposes should involve a more refined use of the homology concept at a molecular level. To this end, we present examples of molecular-data levels at which homology might be considered, and arrange them in a hierarchy. The concept that we propose has many levels, which link directly to the developmental and morphological components of homology. Of note, there is no simple relationship between gene homology and nucleotide homology. We also propose terminology with which to better describe and discuss molecular homology at these levels. Our over-arching conceptual framework is then used to shed light on the multitude of automated procedures that have been created for multiple-sequence alignment. Sequence alignment needs to be based on aligning homologous nucleotides, without necessary reference to homology at any other level of the hierarchy. In particular, inference of nucleotide homology involves deriving a plausible scenario for molecular change among the set of sequences. Our clarifications should allow the development of a procedure that specifically addresses homology, which is required when performing alignment for phylogenetic purposes, but which does not yet exist.

Download Full-text

Benchmarking Statistical Multiple Sequence Alignment

10.1101/304659 ◽

2018 ◽

Cited By ~ 1

Author(s):

Michael Nute ◽

Ehsan Saleh ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structural Alignment ◽

Estimation Method ◽

Simulated Data ◽

Protein Sequences ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Simulated Data Sets

AbstractThe estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical co-estimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy is dramatically more accurate than the other alignment methods on the simulated data sets, but is among the least accurate on the biological benchmarks. There are several potential causes for this discordance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments; future research is needed to understand the most likely explanation for our observations. multiple sequence alignment, BAli-Phy, protein sequences, structural alignment, homology

Download Full-text

Multiple Sequence Alignment Reveals Diversity among Eight African Bush Mango (Irvingia gabonensis Aubry-Lecomte ex O’Rorke) Cultivars

Journal of Experimental Agriculture International ◽

10.9734/jeai/2021/v43i130635 ◽

2021 ◽

pp. 91-96

Author(s):

U. G. Adebo ◽

J. O. Matthew

Keyword(s):

Sequence Analysis ◽

Phylogenetic Tree ◽

Genetic Resources ◽

Data Base ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Multiple Sequence ◽

Bush Mango ◽

Irvingia Gabonensis ◽

Cluster 2

Multiple sequence analysis is one of the most widely used model in estimating similarity among genotypes. In a bid to access useful information for the utilization of bush mango genetic resources, nucleotide sequences of eight bush mango (Irvingia gabonensis) cultivars were sourced for and retrieved form NCBI data base, and evaluated for diversity and similarity using computational biology approach. The highest alignment score (26.18), depicting the highest similarity, was between two pairs of sequence combinations; BM07:BM58 and BM12:BM69 respectively, while the least score (19.43) was between BM01: BM13. The phylogenetic tree broadly divided the cultivars into four distinct groups; BM07, BM58 (cluster one), BM01 (cluster 2), BM15, BM13 and BM35 (cluster 3), and BM12, BM69 (cluster 4), while the sequences obtained from the analysis revealed only few fully conserved regions, with the single nucleotides A, and T, which were consistent throughout the evolution. Results obtained from this study indicate that the bush mango cultivars are divergent and can be useful genetic resources for bush mango improvement through breeding.

Download Full-text

A Polynomial Time Solvable Formulation of Multiple Sequence Alignment

Lecture Notes in Computer Science - Research in Computational Molecular Biology ◽

10.1007/11415770_16 ◽

2005 ◽

pp. 204-216 ◽

Cited By ~ 1

Author(s):

Sing-Hoi Sze ◽

Yue Lu ◽

Qingwu Yang

Keyword(s):

Sequence Alignment ◽

Polynomial Time ◽

Multiple Sequence Alignment ◽

Multiple Sequence

Download Full-text

TCS: A New Multiple Sequence Alignment Reliability Measure to Estimate Alignment Accuracy and Improve Phylogenetic Tree Reconstruction

Molecular Biology and Evolution ◽

10.1093/molbev/msu117 ◽

2014 ◽

Vol 31 (6) ◽

pp. 1625-1637 ◽

Cited By ~ 113

Author(s):

Jia-Ming Chang ◽

Paolo Di Tommaso ◽

Cedric Notredame

Keyword(s):

Phylogenetic Tree ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Alignment Accuracy ◽

Tree Reconstruction ◽

Multiple Sequence ◽

Reliability Measure ◽

Phylogenetic Tree Reconstruction

Download Full-text

A Polynomial Time Solvable Formulation of Multiple Sequence Alignment

Journal of Computational Biology ◽

10.1089/cmb.2006.13.309 ◽

2006 ◽

Vol 13 (2) ◽

pp. 309-319 ◽

Cited By ~ 21

Author(s):

Sing-Hoi Sze ◽

Yue Lu ◽

Qingwu Yang

Keyword(s):

Sequence Alignment ◽

Polynomial Time ◽

Multiple Sequence Alignment ◽

Multiple Sequence

Download Full-text

A Parallel Multiobjective Metaheuristic for Multiple Sequence Alignment

10.1101/103101 ◽

2017 ◽

Author(s):

Álvaro Rubio-Largo ◽

Leonardo Vanneschi ◽

Mauro Castelli ◽

Miguel A. Vega-Rodríguez

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Time Complexity ◽

Optimization Problem ◽

Optimal Alignment ◽

Multiple Sequence ◽

Parallel Metaheuristics ◽

Parallel Performance ◽

Parallel Version ◽

The Comparative Study

AbstractThe alignment among three or more nucleotides/amino-acids sequences at the same time is known as Multiple Sequence Alignment (MSA), an NP-hard optimization problem. The time complexity of finding an optimal alignment raises exponentially when the number of sequences to align increases. In this work, we deal with a multiobjective version of the MSA problem where the goal is to simultaneously optimize the accuracy and conservation of the alignment. A parallel version of the Hybrid Multiobjective Memetic Metaheuristics for Multiple Sequence Alignment is proposed. In order to evaluate the parallel performance of our proposal, we have selected a pull of datasets with different number of sequences (up to 1000 sequences) and study its parallel performance against other well-known parallel metaheuristics published in the literature, such as MSAProbs, T-Coffee, Clustal Ω, and MAFFT. The comparative study reveals that our parallel aligner is around 25 times faster than the sequential version with 32 cores, obtaining a parallel efficiency around 80%.

Download Full-text

Multiple Sequence Alignment and Phylogenetic Tree Construction of Viral Protein 2 of Bluetongue virus

International Journal of Bioinformatics and Biological Science ◽

10.30954/2319-5169.01.2018.6 ◽

2018 ◽

Vol 6 (1) ◽

Keyword(s):

Phylogenetic Tree ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Viral Protein ◽

Bluetongue Virus ◽

Multiple Sequence ◽

Phylogenetic Tree Construction ◽

Tree Construction

Download Full-text

MSAC: Compression of multiple sequence alignment files

10.1101/240341 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz ◽

Joanna Walczyszyn ◽

Agnieszka Debudaj-Grabysz

Keyword(s):

Sequence Alignment ◽

Compression Ratio ◽

Multiple Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Link Type ◽

Bioinformatics Databases ◽

Supplementary Material ◽

Burrows Wheeler Transform

AbstractMotivationBioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge.ResultsWe propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.AvailabilityMSAC is available for free at https://github.com/refresh-bio/msac and http://sun.aei.polsl.pl/REFRESH/[email protected] materialSupplementary data are available at the publisher Web site.

Download Full-text

SequenceBouncer: A method to remove outlier entries from a multiple sequence alignment

10.1101/2020.11.24.395459 ◽

2020 ◽

Author(s):

Cory D. Dunn

Keyword(s):

Nucleic Acid ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Analyses ◽

Protein Sequences ◽

Mitochondrial Genomes ◽

Dna Barcodes ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

AbstractPhylogenetic analyses can take advantage of multiple sequence alignments as input. These alignments typically consist of homologous nucleic acid or protein sequences, and the inclusion of outlier or aberrant sequences can compromise downstream analyses. Here, I describe a program, SequenceBouncer, that uses the Shannon entropy values of alignment columns to identify outlier alignment sequences in a manner responsive to overall alignment context. I demonstrate the utility of this software using alignments of available mammalian mitochondrial genomes, bird cytochrome c oxidase-derived DNA barcodes, and COVID-19 sequences.

Download Full-text