scholarly journals MODELING AMINO ACID SUBSTITUTIONS FOR WHOLE GENOMES

2021 ◽  
Vol 37 (4) ◽  
pp. 351-363
Author(s):  
Le Sy Vinh

Modeling amino acid substitution process is a core task in bioinformatics. New advanced sequencing technologies have generated huge datasets including whole genomes from various species. Estimating amino acid substitution models from whole genome datasets provides us unprecedented opportunities to accurately investigate relationships among species. In this paper, we review state-of-the-art computational methods to estimate amino acid substitution models from large datasets. We also describe a comprehensive pipeline to practically estimate amino acid models from whole genome datasets. Finally, we apply amino acid substitution models to build phylogenomic trees from bird and plant genome datasets. We compare our newly reconstructed phylogenomic trees and published ones and discuss new findings.

2020 ◽  
Author(s):  
Huihui Chang ◽  
Yimeng Nie ◽  
Nan Zhang ◽  
Xue Zhang ◽  
Huimin Sun ◽  
...  

Abstract Background Amino acid substitution models play an important role in inferring phylogenies from mitochondrial proteins. Although different amino acid substitution models have been proposed, only a few were estimated from mitochondrial protein sequences for specific taxa such as the mtArt model for Arthropoda. The increasing of mitochondrial genome data from broad Orthoptera taxa provides an opportunity to estimate the Orthoptera-specific mitochondrial amino acid empirical model. Results We sequenced complete mitochondrial genomes of 54 Orthoptera species, and then estimated an amino acid substitution model (named mtOrt) by maximum likelihood method based on the 283 complete mitochondrial genomes available currently. The results indicated that there are obvious differences between mtOrt and the existing model, and the new model can better fit the Orthoptera mitochondrial protein datasets. Moreover, topologies of trees constructed using mtOrt and existing models are frequently different. MtOrt does indeed have an impact on likelihood improvement as well as tree topologies. The comparisons between the topologies of trees constructed using mtOrt and existing models show that the new model outperforms the existing models in inferring phylogenies from Orthoptera mitochondrial protein data. Conclusions The new mitochondrial amino acid substitution model of Orthoptera shows obvious differences from the existing models, and outperforms the existing models in inferring phylogenies from Orthoptera mitochondrial protein sequences.


Author(s):  
Bui Quang Minh ◽  
Cuong Cao Dang ◽  
Le Sy Vinh ◽  
Robert Lanfear

AbstractAmino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.


2021 ◽  
Author(s):  
Cuong Cao Dang ◽  
Bui Quang Minh ◽  
Hanon McShea ◽  
Joanna Masel ◽  
Jennifer Eleanor James ◽  
...  

Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In this paper, we introduce a maximum likelihood approach nQMaker, an extension of the recently published QMaker method, that allows the estimation of time non-reversible amino acid substitution models and rooted phylogenetic trees from a set of protein sequence alignments. We show that the non-reversible models estimated with nQMaker are a much better fit to empirical alignments than pre-existing reversible models, across a wide range of datasets including mammals, birds, plants, fungi, and other taxa, and that the improvements in model fit scale with the size of the dataset. Notably, for the recently published plant and bird trees, these non-reversible models correctly recovered the commonly known root placements with very high statistical support without the need to use an outgroup. We provide nQMaker as an easy-to-use feature in the IQ-TREE software (http://www.iqtree.org), allowing users to estimate non-reversible models and rooted phylogenies from their own protein datasets.


2019 ◽  
Vol 26 (1) ◽  
pp. 36
Author(s):  
Luciano Antonio Digiampietri ◽  
Vivian Mayumi Yamassaki Pereira ◽  
Geraldo José Santos-Júnior ◽  
Giovani Sousa-Leite ◽  
Priscilla Koch Wagner ◽  
...  

Most of the computational biology analysis is made comparing genomic features. The nucleotide and amino acid sequence alignments are frequently used in gene function identification and genome comparison. Despite its widespread use, there are limitations in their analysis capabilities that need to be considered but are often overlooked or unknown by many researchers. This paper presents a gene based whole genome comparison toolkit which can be used not only as an alternative and more robust way to compare a set of whole genomes, but, also, to understand the tradeoff of the use of sequence local alignment in this kind of comparison. A study case was performed considering fifteen whole genomes of the Xanthomonas genus. The results were compared with the 16S rRNA-processing protein RimM phylogeny and some thresholds for the use of sequence alignments in this kind of analysis were discussed.


2017 ◽  
Author(s):  
David J. Winter ◽  
Steven H. Wu ◽  
Abigail A. Howell ◽  
Ricardo B. R. Azevedo ◽  
Rebecca A. Zufall ◽  
...  

AbstractMotivationMutation accumulation (MA) is the most widely used method for directly studying the effects of mutation. Modern sequencing technologies have led to an increased interest in MA experiments. By sequencing whole genomes from MA lines, researchers can directly study the rate and molecular spectra of spontaneous mutations and use these results to understand how mutation contributes to biological processes. At present there is no software designed specifically for identifying mutations from MA lines. Studies that combine MA with whole genome sequencing use custom bioinformatic pipelines that implement heuristic rules to identify putative mutations.ResultsHere we describe accuMUlate, a program that is designed to detect mutations from MA experiments. accuMUlate implements a probabilistic model that reflects the design of a typical MA experiments while being flexible enough to accommodate properties unique to any particular experiment. For each putative mutation identified from this model accuMUlate calculates a set of summary statistics that can be used to filter sites that may be false positives. A companion tool, denominate, can be used to apply filtering rules based on these statistics to simulated mutations and thus identify the number of callable sites per sample.AvailabilitySource code and releases available from https://github.com/dwinter/accuMUlate.


2008 ◽  
Vol 363 (1512) ◽  
pp. 3941-3953 ◽  
Author(s):  
John P Huelsenbeck ◽  
Paul Joyce ◽  
Clemens Lakner ◽  
Fredrik Ronquist

Models of amino acid substitution present challenges beyond those often faced with the analysis of DNA sequences. The alignments of amino acid sequences are often small, whereas the number of parameters to be estimated is potentially large when compared with the number of free parameters for nucleotide substitution models. Most approaches to the analysis of amino acid alignments have focused on the use of fixed amino acid models in which all of the potentially free parameters are fixed to values estimated from a large number of sequences. Often, these fixed amino acid models are specific to a gene or taxonomic group (e.g. the Mtmam model, which has parameters that are specific to mammalian mitochondrial gene sequences). Although the fixed amino acid models succeed in reducing the number of free parameters to be estimated—indeed, they reduce the number of free parameters from approximately 200 to 0—it is possible that none of the currently available fixed amino acid models is appropriate for a specific alignment. Here, we present four approaches to the analysis of amino acid sequences. First, we explore the use of a general time reversible model of amino acid substitution using a Dirichlet prior probability distribution on the 190 exchangeability parameters. Second, we then explore the behaviour of prior probability distributions that are ‘centred’ on the rates specified by the fixed amino acid model. Third, we consider a mixture of fixed amino acid models. Finally, we consider constraints on the exchangeability parameters as partitions, similar to how nucleotide substitution models are specified, and place a Dirichlet process prior model on all the possible partitioning schemes.


Sign in / Sign up

Export Citation Format

Share Document