Benchmarking Statistical Multiple Sequence Alignment

AbstractThe estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Statistical co-estimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy is dramatically more accurate than the other alignment methods on the simulated data sets, but is among the least accurate on the biological benchmarks. There are several potential causes for this discordance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments; future research is needed to understand the most likely explanation for our observations. multiple sequence alignment, BAli-Phy, protein sequences, structural alignment, homology

Download Full-text

SequenceBouncer: A method to remove outlier entries from a multiple sequence alignment

10.1101/2020.11.24.395459 ◽

2020 ◽

Author(s):

Cory D. Dunn

Keyword(s):

Nucleic Acid ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Analyses ◽

Protein Sequences ◽

Mitochondrial Genomes ◽

Dna Barcodes ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

AbstractPhylogenetic analyses can take advantage of multiple sequence alignments as input. These alignments typically consist of homologous nucleic acid or protein sequences, and the inclusion of outlier or aberrant sequences can compromise downstream analyses. Here, I describe a program, SequenceBouncer, that uses the Shannon entropy values of alignment columns to identify outlier alignment sequences in a manner responsive to overall alignment context. I demonstrate the utility of this software using alignments of available mammalian mitochondrial genomes, bird cytochrome c oxidase-derived DNA barcodes, and COVID-19 sequences.

Download Full-text

Computational Analysis of Therapeutic Enzyme Uricase from Different Source Organisms

Current Proteomics ◽

10.2174/1570164616666190617165107 ◽

2020 ◽

Vol 17 (1) ◽

pp. 59-77

Author(s):

Anand Kumar Nelapati ◽

JagadeeshBabu PonnanEttiyappan

Keyword(s):

Uric Acid ◽

Amino Acid ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Protein Sequences ◽

Amino Acid Sequences ◽

Amino Acid Residues ◽

Multiple Sequence ◽

Physiochemical Properties ◽

Pharmaceutical Industries

Background:Hyperuricemia and gout are the conditions, which is a response of accumulation of uric acid in the blood and urine. Uric acid is the product of purine metabolic pathway in humans. Uricase is a therapeutic enzyme that can enzymatically reduces the concentration of uric acid in serum and urine into more a soluble allantoin. Uricases are widely available in several sources like bacteria, fungi, yeast, plants and animals.Objective:The present study is aimed at elucidating the structure and physiochemical properties of uricase by insilico analysis.Methods:A total number of sixty amino acid sequences of uricase belongs to different sources were obtained from NCBI and different analysis like Multiple Sequence Alignment (MSA), homology search, phylogenetic relation, motif search, domain architecture and physiochemical properties including pI, EC, Ai, Ii, and were performed.Results:Multiple sequence alignment of all the selected protein sequences has exhibited distinct difference between bacterial, fungal, plant and animal sources based on the position-specific existence of conserved amino acid residues. The maximum homology of all the selected protein sequences is between 51-388. In singular category, homology is between 16-337 for bacterial uricase, 14-339 for fungal uricase, 12-317 for plants uricase, and 37-361 for animals uricase. The phylogenetic tree constructed based on the amino acid sequences disclosed clusters indicating that uricase is from different source. The physiochemical features revealed that the uricase amino acid residues are in between 300- 338 with a molecular weight as 33-39kDa and theoretical pI ranging from 4.95-8.88. The amino acid composition results showed that valine amino acid has a high average frequency of 8.79 percentage compared to different amino acids in all analyzed species.Conclusion:In the area of bioinformatics field, this work might be informative and a stepping-stone to other researchers to get an idea about the physicochemical features, evolutionary history and structural motifs of uricase that can be widely used in biotechnological and pharmaceutical industries. Therefore, the proposed in silico analysis can be considered for protein engineering work, as well as for gout therapy.

Download Full-text

Molecular homology and multiple-sequence alignment: an analysis of concepts and practice

Australian Systematic Botany ◽

10.1071/sb15001 ◽

2015 ◽

Vol 28 (1) ◽

pp. 46 ◽

Cited By ~ 20

Author(s):

David A. Morrison ◽

Matthew J. Morgan ◽

Scot A. Kelchner

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Molecular Data ◽

Simple Relationship ◽

Sequence Alignments ◽

Multiple Sequence ◽

Molecular Change ◽

Nucleotide Homology ◽

Tree Building ◽

Molecular Homology

Sequence alignment is just as much a part of phylogenetics as is tree building, although it is often viewed solely as a necessary tool to construct trees. However, alignment for the purpose of phylogenetic inference is primarily about homology, as it is the procedure that expresses homology relationships among the characters, rather than the historical relationships of the taxa. Molecular homology is rather vaguely defined and understood, despite its importance in the molecular age. Indeed, homology has rarely been evaluated with respect to nucleotide sequence alignments, in spite of the fact that nucleotides are the only data that directly represent genotype. All other molecular data represent phenotype, just as do morphology and anatomy. Thus, efforts to improve sequence alignment for phylogenetic purposes should involve a more refined use of the homology concept at a molecular level. To this end, we present examples of molecular-data levels at which homology might be considered, and arrange them in a hierarchy. The concept that we propose has many levels, which link directly to the developmental and morphological components of homology. Of note, there is no simple relationship between gene homology and nucleotide homology. We also propose terminology with which to better describe and discuss molecular homology at these levels. Our over-arching conceptual framework is then used to shed light on the multitude of automated procedures that have been created for multiple-sequence alignment. Sequence alignment needs to be based on aligning homologous nucleotides, without necessary reference to homology at any other level of the hierarchy. In particular, inference of nucleotide homology involves deriving a plausible scenario for molecular change among the set of sequences. Our clarifications should allow the development of a procedure that specifically addresses homology, which is required when performing alignment for phylogenetic purposes, but which does not yet exist.

Download Full-text

Influence of Parameters in Multiple Sequence Alignment Methods for Protein Sequences

Advances in Intelligent Systems and Computing - Progress in Computing, Analytics and Networking ◽

10.1007/978-981-10-7871-2_18 ◽

2018 ◽

pp. 183-191

Author(s):

P. Manikandan ◽

D. Ramyachitra

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Protein Sequences ◽

Multiple Sequence

Download Full-text

A benchmark for evaluation of phylogeny reconstruction programs

10.7287/peerj.preprints.2628v1 ◽

2016 ◽

Author(s):

Sergei Spirin

Keyword(s):

Maximum Likelihood ◽

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Protein Sequences ◽

Relative Accuracy ◽

Phylogeny Reconstruction ◽

Multiple Sequence ◽

Natural Protein ◽

Large Sets ◽

The Moment

There are a lot of algorithms and programs for reconstruction of phylogeny of a set of proteins basing on multiple sequence alignment. Many programs allow users to choose a number of parameters, for example, a model for maximum likelihood programs. Different programs and different parameters often produce different results. However at the moment all published benchmarks for evaluation of relative accuracy of programs or different choices of parameters are based on simulated sequences. The aim of the present work is to create a benchmark that allows a comparison of phylogenetic programs on large sets of alignments of natural protein sequences.

Download Full-text

MSAC: Compression of multiple sequence alignment files

10.1101/240341 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sebastian Deorowicz ◽

Joanna Walczyszyn ◽

Agnieszka Debudaj-Grabysz

Keyword(s):

Sequence Alignment ◽

Compression Ratio ◽

Multiple Sequence Alignment ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Link Type ◽

Bioinformatics Databases ◽

Supplementary Material ◽

Burrows Wheeler Transform

AbstractMotivationBioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e., Pfam, consumes 40–230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge.ResultsWe propose a novel compression algorithm, MSAC (Multiple Sequence Alignment Compressor), designed especially for aligned data. It is based on a generalisation of the positional Burrows–Wheeler transform for non-binary alphabets. MSAC handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e., gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio.AvailabilityMSAC is available for free at https://github.com/refresh-bio/msac and http://sun.aei.polsl.pl/REFRESH/[email protected] materialSupplementary data are available at the publisher Web site.

Download Full-text

MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization

Briefings in Bioinformatics ◽

10.1093/bib/bbx108 ◽

2017 ◽

Vol 20 (4) ◽

pp. 1160-1166 ◽

Cited By ~ 989

Author(s):

Kazutaka Katoh ◽

John Rozewicki ◽

Kazunori D Yamada

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Data ◽

Large Data ◽

Relevant Information ◽

Data Sets ◽

Online Service ◽

Multiple Sequence ◽

Biologically Relevant ◽

Sequencing Technologies

Abstract This article describes several features in the MAFFT online service for multiple sequence alignment (MSA). As a result of recent advances in sequencing technologies, huge numbers of biological sequences are available and the need for MSAs with large numbers of sequences is increasing. To extract biologically relevant information from such data, sophistication of algorithms is necessary but not sufficient. Intuitive and interactive tools for experimental biologists to semiautomatically handle large data are becoming important. We are working on development of MAFFT toward these two directions. Here, we explain (i) the Web interface for recently developed options for large data and (ii) interactive usage to refine sequence data sets and MSAs.

Download Full-text

An Evaluation of Phylogenetic Workflows in Viral Molecular Epidemiology

10.1101/2020.11.24.396820 ◽

2020 ◽

Author(s):

Colin Young ◽

Sarah Meng ◽

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Data ◽

Phylogenetic Inference ◽

The Other ◽

Computational Techniques ◽

Viral Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Branch Lengths

AbstractThe use of computational techniques to analyze viral sequence data and ultimately inform public health intervention has become increasingly common in the realm of epidemiology. These methods typically attempt to make epidemiological inferences based on multiple sequence alignments and phylogenies estimated from the raw sequence data. Like all estimation techniques, multiple sequence alignment and phylogenetic inference tools are error-prone, and the impacts of such imperfections on downstream epidemiological inferences are poorly understood. To address this, we executed multiple commonly-used workflows for conducting viral phylogenetic analyses on simulated viral sequence data modeling HIV, HCV, and Ebola, and we computed multiple methods of accuracy motivated by transmission clustering techniques. For multiple sequence alignment, MAFFT consistently outperformed MUSCLE and Clustal Omega in both accuracy and runtime. For phylogenetic inference, FastTree 2, IQ-TREE, RAxML-NG, and PhyML had similar topological accuracies, but branch lengths and pairwise distances were consistently most accurate in phylogenies inferred by RAxML-NG. However, FastTree 2 was orders of magnitude faster than the other tools, and when the other tools were used to optimize branch lengths along a fixed topology provided by FastTree 2 (i.e., no tree search), the resulting phylogenies had accuracies that were indistinguishable from their original counterparts, but with a fraction of the runtime. Our results indicate that an ideal workflow for viral phylogenetic inference is to (1) use MAFFT to perform MSA, (2) use FastTree 2 under the GTR model with discrete gamma-distributed site-rate heterogeneity to quickly obtain a reasonable tree topology, and (3) use RAxML-NG to optimize branch lengths along the fixed FastTree 2 topology.

Download Full-text

Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.a1369.109119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 771-776

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Evolutionary Relationship ◽

Biological Data ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequencing Technologies ◽

Benchmark Database ◽

Execution Speed ◽

Protein Multiple Sequence Alignment

Protein Multiple sequence alignment (MSA) is a process, that helps in alignment of more than two protein sequences to establish an evolutionary relationship between the sequences. As part of Protein MSA, the biological sequences are aligned in a way to identify maximum similarities. Over time the sequencing technologies are becoming more sophisticated and hence the volume of biological data generated is increasing at an enormous rate. This increase in volume of data poses a challenge to the existing methods used to perform effective MSA as with the increase in data volume the computational complexities also increases and the speed to process decreases. The accuracy of MSA is another factor critically important as many bioinformatics inferences are dependent on the output of MSA. This paper elaborates on the existing state of the art methods of protein MSA and performs a comparison of four leading methods namely MAFFT, Clustal Omega, MUSCLE and ProbCons based on the speed and accuracy of these methods. BAliBASE version 3.0 (BAliBASE is a repository of manually refined multiple sequence alignments) has been used as a benchmark database and accuracy of alignment methods is computed through the two widely used criteria named Sum of pair score (SPscore) and total column score (TCscore). We also recorded the execution time for each method in order to compute the execution speed.

Download Full-text

ClipKIT: a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference

10.1101/2020.06.08.140384 ◽

2020 ◽

Cited By ~ 3

Author(s):

Jacob L. Steenwyk ◽

Thomas J. Buida ◽

Yuanning Li ◽

Xing-Xing Shen ◽

Antonis Rokas

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Phylogenetic Inference ◽

Recent Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Time Saving ◽

Multiple Sequence Alignments

AbstractHighly divergent sites in multiple sequence alignments, which stem from erroneous inference of homology and saturation of substitutions, are thought to negatively impact phylogenetic inference. Trimming methods aim to remove these sites before phylogenetic inference, but recent analysis suggests that doing so can worsen inference. We introduce ClipKIT, a trimming method that instead aims to retain phylogenetically-informative sites; phylogenetic inference using ClipKIT-trimmed alignments is accurate, robust, and time-saving.

Download Full-text