An Improved Search Algorithm for Optimal Multiple-Sequence Alignment

Multiple sequence alignment (MSA) is a ubiquitous problem in computational biology. Although it is NP-hard to find an optimal solution for an arbitrary number of sequences, due to the importance of this problem researchers are trying to push the limits of exact algorithms further. Since MSA can be cast as a classical path finding problem, it is attracting a growing number of AI researchers interested in heuristic search algorithms as a challenge with actual practical relevance. In this paper, we first review two previous, complementary lines of research. Based on Hirschberg's algorithm, Dynamic Programming needs O(kN^(k-1)) space to store both the search frontier and the nodes needed to reconstruct the solution path, for k sequences of length N. Best first search, on the other hand, has the advantage of bounding the search space that has to be explored using a heuristic. However, it is necessary to maintain all explored nodes up to the final solution in order to prevent the search from re-expanding them at higher cost. Earlier approaches to reduce the Closed list are either incompatible with pruning methods for the Open list, or must retain at least the boundary of the Closed list. In this article, we present an algorithm that attempts at combining the respective advantages; like A* it uses a heuristic for pruning the search space, but reduces both the maximum Open and Closed size to O(kN^(k-1)), as in Dynamic Programming. The underlying idea is to conduct a series of searches with successively increasing upper bounds, but using the DP ordering as the key for the Open priority queue. With a suitable choice of thresholds, in practice, a running time below four times that of A* can be expected. In our experiments we show that our algorithm outperforms one of the currently most successful algorithms for optimal multiple sequence alignments, Partial Expansion A*, both in time and memory. Moreover, we apply a refined heuristic based on optimal alignments not only of pairs of sequences, but of larger subsets. This idea is not new; however, to make it practically relevant we show that it is equally important to bound the heuristic computation appropriately, or the overhead can obliterate any possible gain. Furthermore, we discuss a number of improvements in time and space efficiency with regard to practical implementations. Our algorithm, used in conjunction with higher-dimensional heuristics, is able to calculate for the first time the optimal alignment for almost all of the problems in Reference 1 of the benchmark database BAliBASE.

Download Full-text

Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1369.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 771-776

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Evolutionary Relationship ◽

Biological Data ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequencing Technologies ◽

Benchmark Database ◽

Execution Speed ◽

Protein Multiple Sequence Alignment

Protein Multiple sequence alignment (MSA) is a process, that helps in alignment of more than two protein sequences to establish an evolutionary relationship between the sequences. As part of Protein MSA, the biological sequences are aligned in a way to identify maximum similarities. Over time the sequencing technologies are becoming more sophisticated and hence the volume of biological data generated is increasing at an enormous rate. This increase in volume of data poses a challenge to the existing methods used to perform effective MSA as with the increase in data volume the computational complexities also increases and the speed to process decreases. The accuracy of MSA is another factor critically important as many bioinformatics inferences are dependent on the output of MSA. This paper elaborates on the existing state of the art methods of protein MSA and performs a comparison of four leading methods namely MAFFT, Clustal Omega, MUSCLE and ProbCons based on the speed and accuracy of these methods. BAliBASE version 3.0 (BAliBASE is a repository of manually refined multiple sequence alignments) has been used as a benchmark database and accuracy of alignment methods is computed through the two widely used criteria named Sum of pair score (SPscore) and total column score (TCscore). We also recorded the execution time for each method in order to compute the execution speed.

Download Full-text

Binary integer programming for Multiple Sequence Alignment

10.1101/854786 ◽

2019 ◽

Author(s):

S. Ali Lajevardy ◽

Mehrdad Kargari

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Optimal Solution ◽

Binary Integer Programming ◽

Pairwise Sequence Alignment ◽

Sequence Alignments ◽

Optimal Method ◽

Multiple Sequence ◽

The Past ◽

Genetic Algorithm Method

AbstractMolecular biology advances in the past few decades have contributed to the rapid increase in genome sequencing of various organisms; sequence alignment is usually considered as the first step in understanding the molecular function of a sequence. An optimal alignment adjusts two or more sequences in a way that it could compare the maximum number of identical or similar residues. The two sequence alignments types are: Pairwise Sequence Alignment (PSA) and Multiple Sequence Alignment (MSA). While dynamic programming (DP) technique is used in PSA to provide the optimal method, it will lead to more complexity if used in MSA. So, the MSA mainly uses heuristic and approximation methods. This paper presents a mathematical model for MSA that can be used as a basis for optimal solution in different ways. In order to obtain the results, the model is implemented using Genetic Algorithm method on the web.

Download Full-text

Molecular homology and multiple-sequence alignment: an analysis of concepts and practice

Australian Systematic Botany ◽

10.1071/sb15001 ◽

2015 ◽

Vol 28 (1) ◽

pp. 46 ◽

Cited By ~ 20

Author(s):

David A. Morrison ◽

Matthew J. Morgan ◽

Scot A. Kelchner

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Molecular Data ◽

Simple Relationship ◽

Sequence Alignments ◽

Multiple Sequence ◽

Molecular Change ◽

Nucleotide Homology ◽

Tree Building ◽

Molecular Homology

Sequence alignment is just as much a part of phylogenetics as is tree building, although it is often viewed solely as a necessary tool to construct trees. However, alignment for the purpose of phylogenetic inference is primarily about homology, as it is the procedure that expresses homology relationships among the characters, rather than the historical relationships of the taxa. Molecular homology is rather vaguely defined and understood, despite its importance in the molecular age. Indeed, homology has rarely been evaluated with respect to nucleotide sequence alignments, in spite of the fact that nucleotides are the only data that directly represent genotype. All other molecular data represent phenotype, just as do morphology and anatomy. Thus, efforts to improve sequence alignment for phylogenetic purposes should involve a more refined use of the homology concept at a molecular level. To this end, we present examples of molecular-data levels at which homology might be considered, and arrange them in a hierarchy. The concept that we propose has many levels, which link directly to the developmental and morphological components of homology. Of note, there is no simple relationship between gene homology and nucleotide homology. We also propose terminology with which to better describe and discuss molecular homology at these levels. Our over-arching conceptual framework is then used to shed light on the multitude of automated procedures that have been created for multiple-sequence alignment. Sequence alignment needs to be based on aligning homologous nucleotides, without necessary reference to homology at any other level of the hierarchy. In particular, inference of nucleotide homology involves deriving a plausible scenario for molecular change among the set of sequences. Our clarifications should allow the development of a procedure that specifically addresses homology, which is required when performing alignment for phylogenetic purposes, but which does not yet exist.

Download Full-text

EGSA: a new enhanced gravitational search algorithm to resolve multiple sequence alignment problem

International Journal of Intelligent Engineering Informatics ◽

10.1504/ijiei.2018.091007 ◽

2018 ◽

Vol 6 (1/2) ◽

pp. 204 ◽

Cited By ~ 1

Author(s):

Elamine Zemali ◽

Abdelmadjid Boukra

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Search Algorithm ◽

Gravitational Search Algorithm ◽

Multiple Sequence ◽

Alignment Problem ◽

Gravitational Search

Download Full-text

Benchmarking Statistical Multiple Sequence Alignment

10.1101/304659 ◽

2018 ◽

Cited By ~ 1

Author(s):

Michael Nute ◽

Ehsan Saleh ◽

Tandy Warnow

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Structural Alignment ◽

Estimation Method ◽

Simulated Data ◽

Protein Sequences ◽

Data Sets ◽

Sequence Alignments ◽

Multiple Sequence ◽

Simulated Data Sets

AbstractThe estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical co-estimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy is dramatically more accurate than the other alignment methods on the simulated data sets, but is among the least accurate on the biological benchmarks. There are several potential causes for this discordance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments; future research is needed to understand the most likely explanation for our observations. multiple sequence alignment, BAli-Phy, protein sequences, structural alignment, homology

Download Full-text

Multiple Guide Trees in a Tabu Search Algorithm for the Multiple Sequence Alignment Problem

IFIP Advances in Information and Communication Technology - Computer Science and Its Applications ◽

10.1007/978-3-319-19578-0_12 ◽

2015 ◽

pp. 141-152 ◽

Cited By ~ 1

Author(s):

Tahar Mehenni

Keyword(s):

Tabu Search ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Search Algorithm ◽

Tabu Search Algorithm ◽

Multiple Sequence ◽

Alignment Problem ◽

Guide Trees

Download Full-text

Search Space Reduction Technique for Distributed Multiple Sequence Alignment

2009 Sixth IFIP International Conference on Network and Parallel Computing ◽

10.1109/npc.2009.43 ◽

2009 ◽

Cited By ~ 1

Author(s):

Manal Helal ◽

Lenore Mullin ◽

John Potter ◽

Vitali Sintchenko

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Search Space ◽

Reduction Technique ◽

Multiple Sequence ◽

Space Reduction ◽

Search Space Reduction

Download Full-text

Resolving the multiple sequence alignment problem using biogeography-based optimization with multiple populations

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972001550016x ◽

2015 ◽

Vol 13 (04) ◽

pp. 1550016 ◽

Cited By ~ 3

Author(s):

El-Amine Zemali ◽

Abdelmadjid Boukra

Keyword(s):

Sequence Alignment ◽

Dna Sequences ◽

Multiple Sequence Alignment ◽

Search Space ◽

New Method ◽

Average Score ◽

Solution Quality ◽

Multiple Sequence ◽

Multiple Populations ◽

Alignment Problem

The multiple sequence alignment (MSA) is one of the most challenging problems in bioinformatics, it involves discovering similarity between a set of protein or DNA sequences. This paper introduces a new method for the MSA problem called biogeography-based optimization with multiple populations (BBOMP). It is based on a recent metaheuristic inspired from the mathematics of biogeography named biogeography-based optimization (BBO). To improve the exploration ability of BBO, we have introduced a new concept allowing better exploration of the search space. It consists of manipulating multiple populations having each one its own parameters. These parameters are used to build up progressive alignments allowing more diversity. At each iteration, the best found solution is injected in each population. Moreover, to improve solution quality, six operators are defined. These operators are selected with a dynamic probability which changes according to the operators efficiency. In order to test proposed approach performance, we have considered a set of datasets from Balibase 2.0 and compared it with many recent algorithms such as GAPAM, MSA-GA, QEAMSA and RBT-GA. The results show that the proposed approach achieves better average score than the previously cited methods.

Download Full-text

MSAC: Compression of multiple sequence alignment files

10.1101/240341 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz ◽

Joanna Walczyszyn ◽

Agnieszka Debudaj-Grabysz

Keyword(s):

Sequence Alignment ◽

Compression Ratio ◽

Multiple Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Link Type ◽

Bioinformatics Databases ◽

Supplementary Material ◽

Burrows Wheeler Transform

AbstractMotivationBioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge.ResultsWe propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.AvailabilityMSAC is available for free at https://github.com/refresh-bio/msac and http://sun.aei.polsl.pl/REFRESH/[email protected] materialSupplementary data are available at the publisher Web site.

Download Full-text

SequenceBouncer: A method to remove outlier entries from a multiple sequence alignment

10.1101/2020.11.24.395459 ◽

2020 ◽

Author(s):

Cory D. Dunn

Keyword(s):

Nucleic Acid ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Analyses ◽

Protein Sequences ◽

Mitochondrial Genomes ◽

Dna Barcodes ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

AbstractPhylogenetic analyses can take advantage of multiple sequence alignments as input. These alignments typically consist of homologous nucleic acid or protein sequences, and the inclusion of outlier or aberrant sequences can compromise downstream analyses. Here, I describe a program, SequenceBouncer, that uses the Shannon entropy values of alignment columns to identify outlier alignment sequences in a manner responsive to overall alignment context. I demonstrate the utility of this software using alignments of available mammalian mitochondrial genomes, bird cytochrome c oxidase-derived DNA barcodes, and COVID-19 sequences.

Download Full-text