MODELING AMINO ACID SUBSTITUTIONS FOR WHOLE GENOMES

Modeling amino acid substitution process is a core task in bioinformatics. New advanced sequencing technologies have generated huge datasets including whole genomes from various species. Estimating amino acid substitution models from whole genome datasets provides us unprecedented opportunities to accurately investigate relationships among species. In this paper, we review state-of-the-art computational methods to estimate amino acid substitution models from large datasets. We also describe a comprehensive pipeline to practically estimate amino acid models from whole genome datasets. Finally, we apply amino acid substitution models to build phylogenomic trees from bird and plant genome datasets. We compare our newly reconstructed phylogenomic trees and published ones and discuss new findings.

Download Full-text

Estimating Amino Acid Substitution Models: A Comparison of Dayhoff's Estimator, the Resolvent Approach and a Maximum Likelihood Method

Molecular Biology and Evolution ◽

10.1093/oxfordjournals.molbev.a003985 ◽

2002 ◽

Vol 19 (1) ◽

pp. 8-13 ◽

Cited By ~ 89

Author(s):

Tobias Müller ◽

Rainer Spang ◽

Martin Vingron

Keyword(s):

Amino Acid ◽

Maximum Likelihood ◽

Amino Acid Substitution ◽

Maximum Likelihood Method ◽

Likelihood Method ◽

Substitution Models ◽

Resolvent Approach

Download Full-text

MtOrt: An empirical mitochondrial amino acid substitution model for evolutionary studies of Orthoptera insects

10.21203/rs.2.20989/v2 ◽

2020 ◽

Author(s):

Huihui Chang ◽

Yimeng Nie ◽

Nan Zhang ◽

Xue Zhang ◽

Huimin Sun ◽

...

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Mitochondrial Protein ◽

Protein Sequences ◽

Mitochondrial Genomes ◽

Substitution Model ◽

New Model ◽

Substitution Models ◽

Amino Acid Substitution Model ◽

Complete Mitochondrial Genomes

Abstract Background Amino acid substitution models play an important role in inferring phylogenies from mitochondrial proteins. Although different amino acid substitution models have been proposed, only a few were estimated from mitochondrial protein sequences for specific taxa such as the mtArt model for Arthropoda. The increasing of mitochondrial genome data from broad Orthoptera taxa provides an opportunity to estimate the Orthoptera-specific mitochondrial amino acid empirical model. Results We sequenced complete mitochondrial genomes of 54 Orthoptera species, and then estimated an amino acid substitution model (named mtOrt) by maximum likelihood method based on the 283 complete mitochondrial genomes available currently. The results indicated that there are obvious differences between mtOrt and the existing model, and the new model can better fit the Orthoptera mitochondrial protein datasets. Moreover, topologies of trees constructed using mtOrt and existing models are frequently different. MtOrt does indeed have an impact on likelihood improvement as well as tree topologies. The comparisons between the topologies of trees constructed using mtOrt and existing models show that the new model outperforms the existing models in inferring phylogenies from Orthoptera mitochondrial protein data. Conclusions The new mitochondrial amino acid substitution model of Orthoptera shows obvious differences from the existing models, and outperforms the existing models in inferring phylogenies from Orthoptera mitochondrial protein sequences.

Download Full-text

A Fast and Efficient Method for Estimating Amino Acid Substitution Models

2011 Third International Conference on Knowledge and Systems Engineering ◽

10.1109/kse.2011.21 ◽

2011 ◽

Author(s):

Van Dat Le ◽

Cao Cuong Dang ◽

Le Si Quang ◽

Le Sy Vinh

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Efficient Method ◽

Substitution Models

Download Full-text

pQMaker: empirically estimating amino acid substitution models in a parallel environment

2020 12th International Conference on Knowledge and Systems Engineering (KSE) ◽

10.1109/kse50997.2020.9287569 ◽

2020 ◽

Author(s):

Nguyen Duc Canh ◽

Cuong Cao Dang ◽

Le Sy Vinh ◽

Bui Quang Minh ◽

Diep Thi Hoang

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Substitution Models ◽

Parallel Environment

Download Full-text

Protein Type Specific Amino Acid Substitution Models for Influenza Viruses

2011 Third International Conference on Knowledge and Systems Engineering ◽

10.1109/kse.2011.23 ◽

2011 ◽

Author(s):

Nguyen Van Sau ◽

Dang Cao Cuong ◽

Le Si Quang ◽

Le Sy Vinh

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Influenza Viruses ◽

Specific Amino Acid ◽

Substitution Models ◽

Specific Amino Acid Substitution

Download Full-text

QMaker: Fast and accurate method to estimate empirical models of protein evolution

10.1101/2020.02.20.958819 ◽

2020 ◽

Cited By ~ 1

Author(s):

Bui Quang Minh ◽

Cuong Cao Dang ◽

Le Sy Vinh ◽

Robert Lanfear

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Search Algorithm ◽

Phylogenetic Analyses ◽

Accurate Method ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Substitution Models ◽

Protein Alignments

AbstractAmino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.

Download Full-text

nQMaker: estimating time non-reversible amino acid substitution models

10.1101/2021.10.18.464754 ◽

2021 ◽

Author(s):

Cuong Cao Dang ◽

Bui Quang Minh ◽

Hanon McShea ◽

Joanna Masel ◽

Jennifer Eleanor James ◽

...

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Phylogenetic Trees ◽

Phylogenetic Analyses ◽

Sequence Alignments ◽

Biological Reality ◽

Wide Range ◽

Substitution Models ◽

Likelihood Approach ◽

Protein Datasets

Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In this paper, we introduce a maximum likelihood approach nQMaker, an extension of the recently published QMaker method, that allows the estimation of time non-reversible amino acid substitution models and rooted phylogenetic trees from a set of protein sequence alignments. We show that the non-reversible models estimated with nQMaker are a much better fit to empirical alignments than pre-existing reversible models, across a wide range of datasets including mammals, birds, plants, fungi, and other taxa, and that the improvements in model fit scale with the size of the dataset. Notably, for the recently published plant and bird trees, these non-reversible models correctly recovered the commonly known root placements with very high statistical support without the need to use an outgroup. We provide nQMaker as an easy-to-use feature in the IQ-TREE software (http://www.iqtree.org), allowing users to estimate non-reversible models and rooted phylogenies from their own protein datasets.

Download Full-text

A gene based bacterial whole genome comparison toolkit

Revista de Informática Teórica e Aplicada ◽

10.22456/2175-2745.84814 ◽

2019 ◽

Vol 26 (1) ◽

pp. 36

Author(s):

Luciano Antonio Digiampietri ◽

Vivian Mayumi Yamassaki Pereira ◽

Geraldo José Santos-Júnior ◽

Giovani Sousa-Leite ◽

Priscilla Koch Wagner ◽

...

Keyword(s):

Amino Acid ◽

Genome Comparison ◽

Local Alignment ◽

Whole Genome ◽

Rrna Processing ◽

Sequence Alignments ◽

Genomic Features ◽

Whole Genomes ◽

Study Case ◽

Whole Genome Comparison

Most of the computational biology analysis is made comparing genomic features. The nucleotide and amino acid sequence alignments are frequently used in gene function identification and genome comparison. Despite its widespread use, there are limitations in their analysis capabilities that need to be considered but are often overlooked or unknown by many researchers. This paper presents a gene based whole genome comparison toolkit which can be used not only as an alternative and more robust way to compare a set of whole genomes, but, also, to understand the tradeoff of the use of sequence local alignment in this kind of comparison. A study case was performed considering fifteen whole genomes of the Xanthomonas genus. The results were compared with the 16S rRNA-processing protein RimM phylogeny and some thresholds for the use of sequence alignments in this kind of analysis were discussed.

Download Full-text

accuMUlate: A mutation caller designed for mutation accumulation experiments

10.1101/182956 ◽

2017 ◽

Author(s):

David J. Winter ◽

Steven H. Wu ◽

Abigail A. Howell ◽

Ricardo B. R. Azevedo ◽

Rebecca A. Zufall ◽

...

Keyword(s):

Source Code ◽

Mutation Accumulation ◽

Molecular Spectra ◽

Whole Genome ◽

Biological Processes ◽

Summary Statistics ◽

Spontaneous Mutations ◽

Heuristic Rules ◽

Sequencing Technologies ◽

Whole Genomes

AbstractMotivationMutation accumulation (MA) is the most widely used method for directly studying the effects of mutation. Modern sequencing technologies have led to an increased interest in MA experiments. By sequencing whole genomes from MA lines, researchers can directly study the rate and molecular spectra of spontaneous mutations and use these results to understand how mutation contributes to biological processes. At present there is no software designed specifically for identifying mutations from MA lines. Studies that combine MA with whole genome sequencing use custom bioinformatic pipelines that implement heuristic rules to identify putative mutations.ResultsHere we describe accuMUlate, a program that is designed to detect mutations from MA experiments. accuMUlate implements a probabilistic model that reflects the design of a typical MA experiments while being flexible enough to accommodate properties unique to any particular experiment. For each putative mutation identified from this model accuMUlate calculates a set of summary statistics that can be used to filter sites that may be false positives. A companion tool, denominate, can be used to apply filtering rules based on these statistics to simulated mutations and thus identify the number of callable sites per sample.AvailabilitySource code and releases available from https://github.com/dwinter/accuMUlate.

Download Full-text

Bayesian analysis of amino acid substitution models

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2008.0175 ◽

2008 ◽

Vol 363 (1512) ◽

pp. 3941-3953 ◽

Cited By ~ 38

Author(s):

John P Huelsenbeck ◽

Paul Joyce ◽

Clemens Lakner ◽

Fredrik Ronquist

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Dna Sequences ◽

Dirichlet Process ◽

Prior Probability ◽

Nucleotide Substitution ◽

Amino Acid Sequences ◽

Dirichlet Process Prior ◽

Substitution Models ◽

Free Parameters

Models of amino acid substitution present challenges beyond those often faced with the analysis of DNA sequences. The alignments of amino acid sequences are often small, whereas the number of parameters to be estimated is potentially large when compared with the number of free parameters for nucleotide substitution models. Most approaches to the analysis of amino acid alignments have focused on the use of fixed amino acid models in which all of the potentially free parameters are fixed to values estimated from a large number of sequences. Often, these fixed amino acid models are specific to a gene or taxonomic group (e.g. the Mtmam model, which has parameters that are specific to mammalian mitochondrial gene sequences). Although the fixed amino acid models succeed in reducing the number of free parameters to be estimated—indeed, they reduce the number of free parameters from approximately 200 to 0—it is possible that none of the currently available fixed amino acid models is appropriate for a specific alignment. Here, we present four approaches to the analysis of amino acid sequences. First, we explore the use of a general time reversible model of amino acid substitution using a Dirichlet prior probability distribution on the 190 exchangeability parameters. Second, we then explore the behaviour of prior probability distributions that are ‘centred’ on the rates specified by the fixed amino acid model. Third, we consider a mixture of fixed amino acid models. Finally, we consider constraints on the exchangeability parameters as partitions, similar to how nucleotide substitution models are specified, and place a Dirichlet process prior model on all the possible partitioning schemes.

Download Full-text