Statistical Comparison of Nucleotide, Amino Acid, and Codon Substitution Models for Evolutionary Analysis of Protein-Coding Sequences

AbstractComparison of relative fixation rates of synonymous (silent) and nonsynonymous (amino acid-altering) mutations provides a means for understanding the mechanisms of molecular sequence evolution. The nonsynonymous/synonymous rate ratio (ω = dN/dS) is an important indicator of selective pressure at the protein level, with ω = 1 meaning neutral mutations, ω < 1 purifying selection, and ω > 1 diversifying positive selection. Amino acid sites in a protein are expected to be under different selective pressures and have different underlying ω ratios. We develop models that account for heterogeneous ω ratios among amino acid sites and apply them to phylogenetic analyses of protein-coding DNA sequences. These models are useful for testing for adaptive molecular evolution and identifying amino acid sites under diversifying selection. Ten data sets of genes from nuclear, mitochondrial, and viral genomes are analyzed to estimate the distributions of ω among sites. In all data sets analyzed, the selective pressure indicated by the ω ratio is found to be highly heterogeneous among sites. Previously unsuspected Darwinian selection is detected in several genes in which the average ω ratio across sites is <1, but in which some sites are clearly under diversifying selection with ω > 1. Genes undergoing positive selection include the β-globin gene from vertebrates, mitochondrial protein-coding genes from hominoids, the hemagglutinin (HA) gene from human influenza virus A, and HIV-1 env, vif, and pol genes. Tests for the presence of positively selected sites and their subsequent identification appear quite robust to the specific distributional form assumed for ω and can be achieved using any of several models we implement. However, we encountered difficulties in estimating the precise distribution of ω among sites from real data sets.

Download Full-text

Detection of Protein Coding Sequences Using a Mixture Model for Local Protein Amino Acid Sequence

Journal of Computational Biology ◽

10.1089/10665270050081559 ◽

2000 ◽

Vol 7 (1-2) ◽

pp. 317-327 ◽

Cited By ~ 2

Author(s):

Edward C. Thayer ◽

Chris Bystroff ◽

David Baker

Keyword(s):

Amino Acid ◽

Amino Acid Sequence ◽

Mixture Model ◽

Protein Amino Acid ◽

Protein Coding ◽

Coding Sequences ◽

Protein Amino Acid Sequence ◽

Local Protein

Download Full-text

Investigating Protein-Coding Sequence Evolution with Probabilistic Codon Substitution Models

Molecular Biology and Evolution ◽

10.1093/molbev/msn232 ◽

2008 ◽

Vol 26 (2) ◽

pp. 255-271 ◽

Cited By ~ 97

Author(s):

M. Anisimova ◽

C. Kosiol

Keyword(s):

Sequence Evolution ◽

Protein Coding ◽

Coding Sequence ◽

Codon Substitution ◽

Substitution Models

Download Full-text

CAUSA 2.0: accurate and consistent evolutionary analysis of proteins using codon and amino acid unified sequence alignments

10.7287/peerj.preprints.1214v1 ◽

2015 ◽

Author(s):

Xiaolong Wang ◽

Chao Yang

Keyword(s):

Amino Acid ◽

Sequence Alignment ◽

Novel Mutation ◽

Genetic Diseases ◽

Amino Acid Level ◽

Evolutionary Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Coding Sequences ◽

Functional Changes

Multiple sequence alignment (MSA) is widely used to reveal structural and functional changes leading to genetic differences among species, and to reconstruct evolutionary histories of related genes, proteins and genomes. Traditionally, proteins and their coding sequences (CDSs) are aligned and analyzed separately, but often drastically different conclusions were drawn on a same set of data. Here we present a new alignment strategy, Codon and Amino Acid Unified Sequence Alignment (CAUSA) 2.0, which aligns proteins and their coding sequences simultaneously. CAUSA 2.0 optimizes the alignment of CDSs at both codon and amino acid level efficiently. Theoretical analysis showed that CAUSA 2.0 enhances the entropy information content of MSA. Empirical data analysis demonstrated that CAUSA 2.0 is more accurate and consistent than nucleotide, protein or codon level alignments. CAUSA 2.0 locates in-frame indels more accurately, makes the alignment of coding sequences biologically more significant, and reveals several novel mutation mechanisms that relate to some genetic diseases. CAUSA 2.0 is available in website www.DNAPlusPro.com .

Download Full-text

Standard codon substitution models overestimate purifying selection for non-stationary data

10.7287/peerj.preprints.2218v1 ◽

2016 ◽

Author(s):

Benjamin D Kaehler ◽

Von Bing Yap ◽

Gavin A Huttley

Keyword(s):

Natural Selection ◽

De Novo ◽

Purifying Selection ◽

Neutral Evolution ◽

Protein Coding ◽

Synonymous Substitutions ◽

New Model ◽

Sequence Composition ◽

Codon Substitution ◽

Substitution Models

Estimation of natural selection on protein-coding sequences is a key comparative genomics approach for de novo prediction of lineage specific adaptations. Selective pressure is measured on a per-gene basis by comparing the rate of non-synonymous substitutions to the rate of neutral evolution, typically assumed to be the rate of synonymous substitutions. All published codon substitution models have been time-reversible and thus assume that sequence composition does not change over time. We previously demonstrated that if time-reversible DNA substitution models are applied blindly in the presence of changing sequence composition, the number of substitutions is systematically biased towards overestimation. We extend these findings to the case of codon substitution models and further demonstrate that the ratio of non-synonymous to synonymous rates of substitution tends to be underestimated over three data sets of insects, mammals, and vertebrates. Our basis for comparison is a non-stationary codon substitution model that allows sequence composition to change. Model selection and model fit results demonstrate that our new model tends to fit the data better. Direct measurement of non-stationarity shows that bias in estimates of natural selection and genetic distance increases with the degree of violation of the stationarity assumption. Additionally, inferences drawn under time-reversible models are systematically affected by compositional divergence. As genomic sequences accumulate at an accelerating rate, the importance of accurate de novo estimation of natural selection increases. Our results establish that our new model provides a more robust perspective on this fundamental quantity.

Download Full-text

CAUSA 2.0: accurate and consistent evolutionary analysis of proteins using codon and amino acid unified sequence alignments

10.7287/peerj.preprints.1214 ◽

2015 ◽

Author(s):

Xiaolong Wang ◽

Chao Yang

Keyword(s):

Amino Acid ◽

Sequence Alignment ◽

Novel Mutation ◽

Genetic Diseases ◽

Amino Acid Level ◽

Evolutionary Analysis ◽

Sequence Alignments ◽

Multiple Sequence ◽

Coding Sequences ◽

Functional Changes

Multiple sequence alignment (MSA) is widely used to reveal structural and functional changes leading to genetic differences among species, and to reconstruct evolutionary histories of related genes, proteins and genomes. Traditionally, proteins and their coding sequences (CDSs) are aligned and analyzed separately, but often drastically different conclusions were drawn on a same set of data. Here we present a new alignment strategy, Codon and Amino Acid Unified Sequence Alignment (CAUSA) 2.0, which aligns proteins and their coding sequences simultaneously. CAUSA 2.0 optimizes the alignment of CDSs at both codon and amino acid level efficiently. Theoretical analysis showed that CAUSA 2.0 enhances the entropy information content of MSA. Empirical data analysis demonstrated that CAUSA 2.0 is more accurate and consistent than nucleotide, protein or codon level alignments. CAUSA 2.0 locates in-frame indels more accurately, makes the alignment of coding sequences biologically more significant, and reveals several novel mutation mechanisms that relate to some genetic diseases. CAUSA 2.0 is available in website www.DNAPlusPro.com .

Download Full-text

Comparison of Phylogenetic Relationships Constructed by Using Protein-Coding Sequences, Intergenic Sequences and Amino Acid Sequences

ACTA BIOPHYSICA SINICA ◽

10.3724/sp.j.1260.2012.10022 ◽

2012 ◽

Vol 28 (2) ◽

pp. 157

Author(s):

Linyuan YANG ◽

Hong LI ◽

Xiaoqing ZHAO ◽

Guoqing LIU

Keyword(s):

Amino Acid ◽

Phylogenetic Relationships ◽

Amino Acid Sequences ◽

Protein Coding ◽

Coding Sequences ◽

Intergenic Sequences

Download Full-text

Delineation of the Crucial Evolutionary Amino Acid Sites in Trehalose-6-Phosphate Synthase From Higher Plants

Evolutionary Bioinformatics ◽

10.1177/1176934320910145 ◽

2020 ◽

Vol 16 ◽

pp. 117693432091014

Author(s):

Rong Wang ◽

Congfen He ◽

Kun Dong ◽

Xin Zhao ◽

Yaxuan Li ◽

...

Keyword(s):

Amino Acid ◽

Positive Selection ◽

Higher Plants ◽

Functional Divergence ◽

Acid Sites ◽

Evolutionary Analysis ◽

Codon Substitution ◽

Bioinformatics Tools ◽

Amino Acid Properties ◽

Phosphate Synthase

Trehalose-6-phosphate synthase (TPS) is a key enzyme in the biosynthesis of trehalose, with its direct product, trehalose-6-phosphate, playing important roles in regulating whole-plant carbohydrate allocation and utilization. Genes encoding TPS constitute a multigene family in which functional divergence appears to have occurred repeatedly. To identify the crucial evolutionary amino acid sites of TPS in higher plants, a series of bioinformatics tools were applied to investigate the phylogenetic relationships, functional divergence, positive selection, and co-evolution of TPS proteins. First, we identified 150 TPS genes from 13 higher plant species. Phylogenetic analysis placed these TPS proteins into 2 clades: clades A and B, of which clade B could be further divided into 4 subclades (B1-B4). This classification was supported by the intron-exon structures, with more introns present in clade A. Next, detection of the critical functionally divergent amino acid sites resulted in the isolation of a total of 286 sites reflecting nonredundant radical shifts in amino acid properties with a high posterior probability cutoff among subclades. In addition, positively selected sites were identified using a codon substitution model, from which 46 amino acid sites were isolated as exhibiting positive selection at a significant level. Moreover, 18 amino acid sites were highlighted both for functional divergence and positive selection; these may thus potentially represent crucial evolutionary sites in the TPS family. Further co-evolutionary analysis revealed 3 pairs of sites: 11S and 12H, 33S and 34N, and 109G and 110E as demonstrating co-evolution. Finally, the 18 crucial evolutionary amino acid sites were mapped in the 3-dimensional structure. A total of 77 sites harboring functionally and structurally important residues of TPS proteins were found by using the CLIPS-4D online tool; notably, no overlap was observed with the identified crucial evolutionary sites, providing positive evidence supporting their designation. A total of 18 sites were isolated as key amino acids by using multiple bioinformatics tools based on their concomitant functional divergence and positive selection. Almost all these key sites are located in 2 domains of this protein family where they exhibit no overlap with the structurally and functionally conserved sites. These results will provide an improved understanding of the complexity of the TPS gene family and of its function and evolution in higher plants. Moreover, this knowledge may facilitate the exploitation of these sites for protein engineering applications.

Download Full-text

Rapid protein sequence evolution via compensatory frameshift is widespread in RNA virus genomes

BMC Bioinformatics ◽

10.1186/s12859-021-04182-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dongbin Park ◽

Yoonsoo Hahn

Keyword(s):

Amino Acid ◽

Large Scale ◽

Rna Viruses ◽

Rna Virus ◽

Phylogenetic Analyses ◽

Sequence Evolution ◽

Protein Coding ◽

Coding Sequences ◽

Reading Frame ◽

Nucleotide Insertions

Abstract Background RNA viruses possess remarkable evolutionary versatility driven by the high mutability of their genomes. Frameshifting nucleotide insertions or deletions (indels), which cause the premature termination of proteins, are frequently observed in the coding sequences of various viral genomes. When a secondary indel occurs near the primary indel site, the open reading frame can be restored to produce functional proteins, a phenomenon known as the compensatory frameshift. Results In this study, we systematically analyzed publicly available viral genome sequences and identified compensatory frameshift events in hundreds of viral protein-coding sequences. Compensatory frameshift events resulted in large-scale amino acid differences between the compensatory frameshift form and the wild type even though their nucleotide sequences were almost identical. Phylogenetic analyses revealed that the evolutionary distance between proteins with and without a compensatory frameshift were significantly overestimated because amino acid mismatches caused by compensatory frameshifts were counted as substitutions. Further, this could cause compensatory frameshift forms to branch in different locations in the protein and nucleotide trees, which may obscure the correct interpretation of phylogenetic relationships between variant viruses. Conclusions Our results imply that the compensatory frameshift is one of the mechanisms driving the rapid protein evolution of RNA viruses and potentially assisting their host-range expansion and adaptation.

Download Full-text