nQMaker: estimating time non-reversible amino acid substitution models

Amino acid substitution models are a key component in phylogenetic analyses of protein sequences. All amino acid models available to date are time-reversible, an assumption designed for computational convenience but not for biological reality. Another significant downside to time-reversible models is that they do not allow inference of rooted trees without outgroups. In this paper, we introduce a maximum likelihood approach nQMaker, an extension of the recently published QMaker method, that allows the estimation of time non-reversible amino acid substitution models and rooted phylogenetic trees from a set of protein sequence alignments. We show that the non-reversible models estimated with nQMaker are a much better fit to empirical alignments than pre-existing reversible models, across a wide range of datasets including mammals, birds, plants, fungi, and other taxa, and that the improvements in model fit scale with the size of the dataset. Notably, for the recently published plant and bird trees, these non-reversible models correctly recovered the commonly known root placements with very high statistical support without the need to use an outgroup. We provide nQMaker as an easy-to-use feature in the IQ-TREE software (http://www.iqtree.org), allowing users to estimate non-reversible models and rooted phylogenies from their own protein datasets.

Download Full-text

QMaker: Fast and accurate method to estimate empirical models of protein evolution

10.1101/2020.02.20.958819 ◽

2020 ◽

Cited By ~ 1

Author(s):

Bui Quang Minh ◽

Cuong Cao Dang ◽

Le Sy Vinh ◽

Robert Lanfear

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Search Algorithm ◽

Phylogenetic Analyses ◽

Accurate Method ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Substitution Models ◽

Protein Alignments

AbstractAmino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models, however, they are typically complicated and slow. In this paper, we propose QMaker, a new ML method to estimate a general time-reversible Q matrix from a large protein dataset consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.

Download Full-text

QMaker: Fast and Accurate Method to Estimate Empirical Models of Protein Evolution

Systematic Biology ◽

10.1093/sysbio/syab010 ◽

2021 ◽

Author(s):

Bui Quang Minh ◽

Cuong Cao Dang ◽

Le Sy Vinh ◽

Robert Lanfear

Keyword(s):

Amino Acid ◽

Maximum Likelihood ◽

Amino Acid Substitution ◽

Search Algorithm ◽

Phylogenetic Analyses ◽

Accurate Method ◽

Sequence Alignments ◽

Multiple Sequence ◽

Data Set ◽

Substitution Models

Abstract Amino acid substitution models play a crucial role in phylogenetic analyses. Maximum likelihood (ML) methods have been proposed to estimate amino acid substitution models; however, they are typically complicated and slow. In this article, we propose QMaker, a new ML method to estimate a general time-reversible $Q$ matrix from a large protein data set consisting of multiple sequence alignments. QMaker combines an efficient ML tree search algorithm, a model selection for handling the model heterogeneity among alignments, and the consideration of rate mixture models among sites. We provide QMaker as a user-friendly function in the IQ-TREE software package (http://www.iqtree.org) supporting the use of multiple CPU cores so that biologists can easily estimate amino acid substitution models from their own protein alignments. We used QMaker to estimate new empirical general amino acid substitution models from the current Pfam database as well as five clade-specific models for mammals, birds, insects, yeasts, and plants. Our results show that the new models considerably improve the fit between model and data and in some cases influence the inference of phylogenetic tree topologies.[Amino acid replacement matrices; amino acid substitution models; maximum likelihood estimation; phylogenetic inferences.]

Download Full-text

Hubsm: A Novel Amino Acid Substitution Matrix for Comparing Hub Proteins

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i8.53 ◽

2017 ◽

Vol 7 (8) ◽

pp. 212

Author(s):

Renganayaki G. ◽

Achuthsankar S. Nair

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Low Complexity ◽

Database Search ◽

Substitution Matrix ◽

Compositional Bias ◽

Sequence Alignments ◽

Amino Acid Substitution Matrix ◽

Alignment Algorithms ◽

Hub Proteins

Sequence alignment algorithms and database search methods use BLOSUM and PAM substitution matrices constructed from general proteins. These de facto matrices are not optimal to align sequences accurately, for the proteins with markedly different compositional bias in the amino acid. In this work, a new amino acid substitution matrix is calculated for the disorder and low complexity rich region of Hub proteins, based on residue characteristics. Insights into the amino acid background frequencies and the substitution scores obtained from the Hubsm unveils the residue substitution patterns which differs from commonly used scoring matrices .When comparing the Hub protein sequences for detecting homologs, the use of this Hubsm matrix yields better results than PAM and BLOSUM matrices. Usage of Hubsm matrix can be optimal in database search and for the construction of more accurate sequence alignments of Hub proteins.

Download Full-text

A test statistic to quantify treelikeness in phylogenetics

10.1101/2021.02.16.431544 ◽

2021 ◽

Author(s):

Caitlin Cherryh ◽

Bui Quang Minh ◽

Rob Lanfear

Keyword(s):

Evolutionary History ◽

Incomplete Lineage Sorting ◽

Phylogenetic Analyses ◽

Phylogenetic Network ◽

Parametric Bootstrap ◽

Lineage Sorting ◽

Test Statistic ◽

Sequence Alignments ◽

Wide Range ◽

History Of

AbstractMost phylogenetic analyses assume that the evolutionary history of an alignment (either that of a single locus, or of multiple concatenated loci) can be described by a single bifurcating tree, the so-called the treelikeness assumption. Treelikeness can be violated by biological events such as recombination, introgression, or incomplete lineage sorting, and by systematic errors in phylogenetic analyses. The incorrect assumption of treelikeness may then mislead phylogenetic inferences. To quantify and test for treelikeness in alignments, we develop a test statistic which we call the tree proportion. This statistic quantifies the proportion of the edge weights in a phylogenetic network that are represented in a bifurcating phylogenetic tree of the same alignment. We extend this statistic to a statistical test of treelikeness using a parametric bootstrap. We use extensive simulations to compare tree proportion to a range of related approaches. We show that tree proportion successfully identifies non-treelikeness in a wide range of simulation scenarios, and discuss its strengths and weaknesses compared to other approaches. The power of the tree-proportion test to reject non-treelike alignments can be lower than some other approaches, but these approaches tend to be limited in their scope and/or the ease with which they can be interpreted. Our recommendation is to test treelikeness of sequence alignments with both tree proportion and mosaic methods such as 3Seq. The scripts necessary to replicate this study are available at https://github.com/caitlinch/treelikeness

Download Full-text

PROGRESSIVE MULTIPLE ALIGNMENT USING SEQUENCE TRIPLET OPTIMIZATIONS AND THREE-RESIDUE EXCHANGE COSTS

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720004000831 ◽

2004 ◽

Vol 02 (04) ◽

pp. 719-745 ◽

Cited By ~ 8

Author(s):

ARUN SIDDHARTH KONAGURTHU ◽

JAMES WHISSTOCK ◽

PETER J. STUCKEY

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Multiple Alignment ◽

Substitution Matrix ◽

Practical Approach ◽

Sequence Alignments ◽

Optimal Sequence ◽

Amino Acid Substitution Matrix ◽

Multiple Alignments ◽

First Time

In this paper we demonstrate a practical approach to construct progressive multiple alignments using sequence triplet optimizations rather than a conventional pairwise approach. Using the sequence triplet alignments progressively provides a scope for the synthesis of a three-residue exchange amino acid substitution matrix. We develop such a 20×20×20 matrix for the first time and demonstrate how its use in optimal sequence triplet alignments increases the sensitivity of building multiple alignments. Various comparisons were made between alignments generated using the progressive triplet methods and the conventional progressive pairwise procedure. The assessment of these data reveal that, in general, the triplet based approaches generate more accurate sequence alignments than the traditional pairwise based procedures, especially between more divergent sets of sequences.

Download Full-text

Estimating Amino Acid Substitution Models: A Comparison of Dayhoff's Estimator, the Resolvent Approach and a Maximum Likelihood Method

Molecular Biology and Evolution ◽

10.1093/oxfordjournals.molbev.a003985 ◽

2002 ◽

Vol 19 (1) ◽

pp. 8-13 ◽

Cited By ~ 89

Author(s):

Tobias Müller ◽

Rainer Spang ◽

Martin Vingron

Keyword(s):

Amino Acid ◽

Maximum Likelihood ◽

Amino Acid Substitution ◽

Maximum Likelihood Method ◽

Likelihood Method ◽

Substitution Models ◽

Resolvent Approach

Download Full-text

Broad-spectrum detection of papillomaviruses in bovine teat papillomas and healthy teat skin

Journal of General Virology ◽

10.1099/vir.0.80086-0 ◽

2004 ◽

Vol 85 (8) ◽

pp. 2191-2197 ◽

Cited By ~ 77

Author(s):

Tomoko Ogawa ◽

Yoshimi Tomita ◽

Mineyuki Okada ◽

Kuniko Shinozaki ◽

Hiroko Kubonoya ◽

...

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Broad Spectrum ◽

Phylogenetic Analyses ◽

Genomic Diversity ◽

Bovine Papillomavirus ◽

Subclinical Infection ◽

Healthy Skin ◽

Sequence Alignments ◽

Spectrum Detection

To investigate the prevalence of bovine papillomavirus (BPV) in bovine papilloma and healthy skin, DNA extracted from teat papillomas and healthy teat skin swabs was analysed by PCR using the primer pairs FAP59/FAP64 and MY09/MY11. Papillomavirus (PV) DNA was detected in all 15 papilloma specimens using FAP59/FAP64 and in 8 of the 15 papilloma specimens using MY09/MY11. In swab samples, 21 and 8 of the 122 samples were PV DNA positive using FAP59/FAP64 and MY09/MY11, respectively. Four BPV types (BPV-1, -3, -5 and -6), two previously identified putative BPV types (BAA1 and -5) and 11 putative new PV types (designated BAPV1 to -10 and BAPV11MY) were found in the 39 PV DNA-positive samples. Amino acid sequence alignments of the putative new PV types with reported BPVs and phylogenetic analyses of the putative new PV types with human and animal PV types showed that BAPV1 to -10 and BAPV11MY are putative new BPV types. These results also showed the genomic diversity and extent of subclinical infection of BPV.

Download Full-text

MtOrt: An empirical mitochondrial amino acid substitution model for evolutionary studies of Orthoptera insects

10.21203/rs.2.20989/v2 ◽

2020 ◽

Author(s):

Huihui Chang ◽

Yimeng Nie ◽

Nan Zhang ◽

Xue Zhang ◽

Huimin Sun ◽

...

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Mitochondrial Protein ◽

Protein Sequences ◽

Mitochondrial Genomes ◽

Substitution Model ◽

New Model ◽

Substitution Models ◽

Amino Acid Substitution Model ◽

Complete Mitochondrial Genomes

Abstract Background Amino acid substitution models play an important role in inferring phylogenies from mitochondrial proteins. Although different amino acid substitution models have been proposed, only a few were estimated from mitochondrial protein sequences for specific taxa such as the mtArt model for Arthropoda. The increasing of mitochondrial genome data from broad Orthoptera taxa provides an opportunity to estimate the Orthoptera-specific mitochondrial amino acid empirical model. Results We sequenced complete mitochondrial genomes of 54 Orthoptera species, and then estimated an amino acid substitution model (named mtOrt) by maximum likelihood method based on the 283 complete mitochondrial genomes available currently. The results indicated that there are obvious differences between mtOrt and the existing model, and the new model can better fit the Orthoptera mitochondrial protein datasets. Moreover, topologies of trees constructed using mtOrt and existing models are frequently different. MtOrt does indeed have an impact on likelihood improvement as well as tree topologies. The comparisons between the topologies of trees constructed using mtOrt and existing models show that the new model outperforms the existing models in inferring phylogenies from Orthoptera mitochondrial protein data. Conclusions The new mitochondrial amino acid substitution model of Orthoptera shows obvious differences from the existing models, and outperforms the existing models in inferring phylogenies from Orthoptera mitochondrial protein sequences.

Download Full-text

Phylogenetic Analyses of Sites in Different Protein Structural Environments Result in Distinct Placements of the Metazoan Root

Biology ◽

10.3390/biology9040064 ◽

2020 ◽

Vol 9 (4) ◽

pp. 64 ◽

Cited By ~ 6

Author(s):

Akanksha Pandey ◽

Edward L. Braun

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Solvent Accessibility ◽

Phylogenetic Signal ◽

Phylogenetic Analyses ◽

Sister Group ◽

Striking Difference ◽

Relative Solvent Accessibility ◽

Protein Datasets ◽

The Impact

Phylogenomics, the use of large datasets to examine phylogeny, has revolutionized the study of evolutionary relationships. However, genome-scale data have not been able to resolve all relationships in the tree of life; this could reflect, at least in part, the poor-fit of the models used to analyze heterogeneous datasets. Some of the heterogeneity may reflect the different patterns of selection on proteins based on their structures. To test that hypothesis, we developed a pipeline to divide phylogenomic protein datasets into subsets based on secondary structure and relative solvent accessibility. We then tested whether amino acids in different structural environments had distinct signals for the topology of the deepest branches in the metazoan tree. We focused on a dataset that appeared to have a mixture of signals and we found that the most striking difference in phylogenetic signal reflected relative solvent accessibility. Analyses of exposed sites (residues located on the surface of proteins) yielded a tree that placed ctenophores sister to all other animals whereas sites buried inside proteins yielded a tree with a sponge+ctenophore clade. These differences in phylogenetic signal were not ameliorated when we conducted analyses using a set of maximum-likelihood profile mixture models. These models are very similar to the Bayesian CAT model, which has been used in many analyses of deep metazoan phylogeny. In contrast, analyses conducted after recoding amino acids to limit the impact of deviations from compositional stationarity increased the congruence in the estimates of phylogeny for exposed and buried sites; after recoding amino acid trees estimated using the exposed and buried site both supported placement of ctenophores sister to all other animals. Although the central conclusion of our analyses is that sites in different structural environments yield distinct trees when analyzed using models of protein evolution, our amino acid recoding analyses also have implications for metazoan evolution. Specifically, our results add to the evidence that ctenophores are the sister group of all other animals and they further suggest that the placozoa+cnidaria clade found in some other studies deserves more attention. Taken as a whole, these results provide striking evidence that it is necessary to achieve a better understanding of the constraints due to protein structure to improve phylogenetic estimation.

Download Full-text

The Triple Amino Acid Substitution TAP-IVS in the EPSPS Gene Confers High Glyphosate Resistance to the Superweed Amaranthus hybridus

International Journal of Molecular Sciences ◽

10.3390/ijms20102396 ◽

2019 ◽

Vol 20 (10) ◽

pp. 2396 ◽

Cited By ~ 15

Author(s):

Maria J. García ◽

Candelario Palma-Bautista ◽

Antonia M. Rojano-Delgado ◽

Enzo Bracamonte ◽

João Portugal ◽

...

Keyword(s):

Amino Acid ◽

Amino Acid Substitution ◽

Glyphosate Resistance ◽

Wild Type ◽

Triple Mutant ◽

Resistance Level ◽

Amaranthus Hybridus ◽

Susceptible Population ◽

Epsps Gene ◽

Wide Range

The introduction of glyphosate-resistant (GR) crops revolutionized weed management; however, the improper use of this technology has selected for a wide range of weeds resistant to glyphosate, referred to as superweeds. We characterized the high glyphosate resistance level of an Amaranthus hybridus population (GRH)—a superweed collected in a GR-soybean field from Cordoba, Argentina—as well as the resistance mechanisms that govern it in comparison to a susceptible population (GSH). The GRH population was 100.6 times more resistant than the GSH population. Reduced absorption and metabolism of glyphosate, as well as gene duplication of 5-enolpyruvylshikimate-3-phosphate synthase (EPSPS) or its overexpression did not contribute to this resistance. However, GSH plants translocated at least 10% more 14C-glyphosate to the rest of the plant and roots than GRH plants at 9 h after treatment. In addition, a novel triple amino acid substitution from TAP (wild type, GSH) to IVS (triple mutant, GRH) was identified in the EPSPS gene of the GRH. The nucleotide substitutions consisted of ATA102, GTC103 and TCA106 instead of ACA102, GCG103, and CCA106, respectively. The hydrogen bond distances between Gly-101 and Arg-105 positions increased from 2.89 Å (wild type) to 2.93 Å (triple-mutant) according to the EPSPS structural modeling. These results support that the high level of glyphosate resistance of the GRH A. hybridus population was mainly governed by the triple mutation TAP-IVS found of the EPSPS target site, but the impaired translocation of herbicide also contributed in this resistance.

Download Full-text