Theory of measurement for site-specific evolutionary rates in amino-acid sequences

10.7287/peerj.preprints.3002v1 ◽

2017 ◽

Author(s):

Dariya K Sydykova ◽

Claus O Wilke

Keyword(s):

Amino Acid ◽

Fitness Landscape ◽

Protein Complexes ◽

Theory Of Measurement ◽

Amino Acid Sequences ◽

Selection Model ◽

Evolutionary Rates ◽

Site Specific ◽

The Matrix ◽

Tree Inference

Many applications require the calculation of site-specific evolutionary rates from alignments of amino-acid sequences. For example, catalytic residues in enzymes and interface regions in protein complexes can be inferred from observed relative rates. While numerous approaches exist to calculate amino-acid rates, however, it is not entirely clear what physical quantities the inferred rates represent and how these rates relate to the underlying fitness landscape of the evolving protein. Further, amino-acid rates can be calculated in the context of different amino-acid exchangeability matrices, such as JTT, LG, or WAG, and again it is not known how the choice of the matrix influences the physical interpretation of the inferred rates. Here, we develop a theory of measurement for site-specific evolutionary rates, but analytically solving the maximum-likelihood equations for rate inference performed on sequences evolved under a mutation–selection model. We demonstrate that the measurement process can only recover the true expected rates of the mutation–selection model if rates are measured relative to a naïve exchangeability matrix, in which all exchangeabilities are equal to one. Rate measurements using other matrices are quantitatively close but not mathematically correct. Our results demonstrate that insights obtained from phylogenetic-tree inference do not necessarily apply to rate inference, and best practices for the former may be deleterious for the latter.

Download Full-text

Theory of measurement for site-specific evolutionary rates in amino-acid sequences

10.1101/411025 ◽

2018 ◽

Cited By ~ 1

Author(s):

Dariya K. Sydykova ◽

Claus O. Wilke

Keyword(s):

Amino Acid ◽

Maximum Likelihood ◽

Fitness Landscape ◽

Theory Of Measurement ◽

Amino Acid Sequences ◽

Selection Model ◽

Evolutionary Rates ◽

Likelihood Inference ◽

Model Parameters ◽

Site Specific

In the field of molecular evolution, we commonly calculate site-specific evolutionary rates from alignments of amino-acid sequences. For example, catalytic residues in enzymes and interface regions in protein complexes can be inferred from observed relative rates. While numerous approaches exist to calculate amino-acid rates, it is not entirely clear what physical quantities the inferred rates represent and how these rates relate to the underlying fitness landscape of the evolving proteins. Further, amino-acid rates can be calculated in the context of different amino-acid exchangeability matrices, such as JTT, LG, or WAG, and again it is not well understood how the choice of the matrix influences the physical inter-pretation of the inferred rates. Here, we develop a theory of measurement for site-specific evolutionary rates, by analytically solving the maximum-likelihood equations for rate inference performed on sequences evolved under a mutation–selection model. We demonstrate that for realistic analysis settings the measurement process will recover the true expected rates of the mutation–selection model if rates are measured relative to a naïve exchangeability matrix, in which all exchangeabilities are equal to 1/19. We also show that rate measurements using other matrices are quantitatively close but in general not mathematically equivalent. Our results demonstrate that insights obtained from phylogenetic-tree inference do not necessarily apply to rate inference, and best practices for the former may be deleterious for the latter.Significance StatementMaximum likelihood inference is widely used to infer model parameters from sequence data in an evolutionary context. One major challenge in such inference procedures is the problem of having to identify the appropriate model used for inference. Model parameters usually are meaningful only to the extent that the model is appropriately specified and matches the process that generated the data. However, in practice, we don’t know what process generated the data, and most models in actual use are misspecified. To circumvent this problem, we show here that we can employ maximum likelihood inference to make defined and meaningful measurements on arbitrary processes. Our approach uses misspecification as a deliberate strategy, and this strategy results in robust and meaningful parameter inference.

Download Full-text

Calculating site-specific evolutionary rates at the amino-acid or codon level yields similar rate estimates

PeerJ ◽

10.7717/peerj.3391 ◽

2017 ◽

Vol 5 ◽

pp. e3391 ◽

Cited By ~ 6

Author(s):

Dariya K. Sydykova ◽

Claus O. Wilke

Keyword(s):

Amino Acid ◽

Conservation Score ◽

Amino Acid Level ◽

Sequence Divergence ◽

Similar Rate ◽

Amino Acid Sequences ◽

Evolutionary Rates ◽

Site Specific ◽

Relative Conservation ◽

The Relationship

Site-specific evolutionary rates can be estimated from codon sequences or from amino-acid sequences. For codon sequences, the most popular methods use some variation of the dN∕dS ratio. For amino-acid sequences, one widely-used method is called Rate4Site, and it assigns a relative conservation score to each site in an alignment. How site-wise dN∕dS values relate to Rate4Site scores is not known. Here we elucidate the relationship between these two rate measurements. We simulate sequences with known dN∕dS, using either dN∕dS models or mutation–selection models for simulation. We then infer Rate4Site scores on the simulated alignments, and we compare those scores to either true or inferred dN∕dS values on the same alignments. We find that Rate4Site scores generally correlate well with true dN∕dS, and the correlation strengths increase in alignments with greater sequence divergence and more taxa. Moreover, Rate4Site scores correlate very well with inferred (as opposed to true) dN∕dS values, even for small alignments with little divergence. Finally, we verify this relationship between Rate4Site and dN∕dS in a variety of empirical datasets. We conclude that codon-level and amino-acid-level analysis frameworks are directly comparable and yield very similar inferences.

Download Full-text

Calculating site-specific evolutionary rates at the amino-acid or codon level yields similar rate estimates

10.7287/peerj.preprints.2739v1 ◽

2017 ◽

Author(s):

Dariya K. Sydykova ◽

Claus O Wilke

Keyword(s):

Amino Acid ◽

Conservation Score ◽

Amino Acid Level ◽

Sequence Divergence ◽

Similar Rate ◽

Amino Acid Sequences ◽

Evolutionary Rates ◽

Sequence Alignments ◽

Site Specific ◽

Relative Conservation

Site-specific evolutionary rates can be estimated from codon sequences or from amino-acid sequences. For codon sequences, the most popular methods use some variation of the dN/dS ratio. For amino-acid sequences, one widely-used method is called Rate4Site, and it assigns a relative conservation score to each site in an alignment. How site-wise dN/dS values relate to Rate4Site scores is not known. Here we elucidate the relationship between these two rate measurements. We simulate sequences with known dN/dS, using either dN/dS models or mutation--selection models for simulation. We then infer Rate4Site scores on the simulated alignments, and we compare those scores to either true or inferred dN/dS values on the same alignments. We find that Rate4Site scores generally correlate well with true dN/dS, and the correlation strengths increase in alignments with higher sequence divergence and higher number of taxa. Moreover, Rate4Site scores correlate nearly perfectly with inferred dN/dS values, even for small alignments with little divergence. Finally, we verify this relationship between Rate4Site and dN/dS in a variety of natural sequence alignments. We conclude that codon-level and amino-acid-level analysis frameworks are directly comparable and yield near-identical inferences.

Download Full-text

Calculating site-specific evolutionary rates at the amino-acid or codon level yields similar rate estimates

10.7287/peerj.preprints.2739 ◽

2017 ◽

Author(s):

Dariya K. Sydykova ◽

Claus O Wilke

Keyword(s):

Amino Acid ◽

Conservation Score ◽

Amino Acid Level ◽

Sequence Divergence ◽

Similar Rate ◽

Amino Acid Sequences ◽

Evolutionary Rates ◽

Sequence Alignments ◽

Site Specific ◽

Relative Conservation

Site-specific evolutionary rates can be estimated from codon sequences or from amino-acid sequences. For codon sequences, the most popular methods use some variation of the dN/dS ratio. For amino-acid sequences, one widely-used method is called Rate4Site, and it assigns a relative conservation score to each site in an alignment. How site-wise dN/dS values relate to Rate4Site scores is not known. Here we elucidate the relationship between these two rate measurements. We simulate sequences with known dN/dS, using either dN/dS models or mutation--selection models for simulation. We then infer Rate4Site scores on the simulated alignments, and we compare those scores to either true or inferred dN/dS values on the same alignments. We find that Rate4Site scores generally correlate well with true dN/dS, and the correlation strengths increase in alignments with higher sequence divergence and higher number of taxa. Moreover, Rate4Site scores correlate nearly perfectly with inferred dN/dS values, even for small alignments with little divergence. Finally, we verify this relationship between Rate4Site and dN/dS in a variety of natural sequence alignments. We conclude that codon-level and amino-acid-level analysis frameworks are directly comparable and yield near-identical inferences.

Download Full-text

Activation domains of gene-specific transcription factors: are histones among their targets?

Biochemistry and Cell Biology ◽

10.1139/o04-036 ◽

2004 ◽

Vol 82 (4) ◽

pp. 453-459 ◽

Cited By ~ 5

Author(s):

Alexandre M Erkine

Keyword(s):

Transcription Factors ◽

Amino Acid ◽

Chromatin Remodeling ◽

Transcription Initiation ◽

Protein Complexes ◽

Amino Acid Sequences ◽

Gene Promoters ◽

Activation Domain ◽

Activation Domains ◽

Multiple Protein

Activation domains of promoter-specific transcription factors are critical entities involved in recruitment of multiple protein complexes to gene promoters. The activation domains often retain functionality when transferred between very diverse eukaryotic phyla, yet the amino acid sequences of activation domains do not bear any specific consensus or secondary structure. Activation domains function in the context of chromatin structure and are critical for chromatin remodeling, which is associated with transcription initiation. The mechanisms of direct and indirect recruitment of chromatin-remodeling and histone-modifying complexes, including mechanisms involving direct interactions between activation domains and histones, are discussed.Key words: activation domain, transcription, chromatin, nucleosome.

Download Full-text

Characterization of a Bacteroides Mobilizable Transposon, NBU2, Which Carries a Functional Lincomycin Resistance Gene

Journal of Bacteriology ◽

10.1128/jb.182.12.3559-3571.2000 ◽

2000 ◽

Vol 182 (12) ◽

pp. 3559-3571 ◽

Cited By ~ 55

Author(s):

Jun Wang ◽

Nadja B. Shoemaker ◽

Gui-Rong Wang ◽

Abigail A. Salyers

Keyword(s):

Amino Acid ◽

Resistance Gene ◽

Resistance Genes ◽

Sequence Similarity ◽

Antibiotic Resistance Genes ◽

Small Region ◽

Amino Acid Sequences ◽

Integrase Gene ◽

Site Specific ◽

Lincomycin Resistance

ABSTRACT The mobilizable Bacteroides element NBU2 (11 kbp) was found originally in two Bacteroides clinical isolates,Bacteroides fragilis ERL and B. thetaiotaomicron DOT. At first, NBU2 appeared to be very similar to another mobilizable Bacteroides element, NBU1, in a 2.5-kbp internal region, but further examination of the full DNA sequence of NBU2 now reveals that the region of near identity between NBU1 and NBU2 is limited to this small region and that, outside this region, there is little sequence similarity between the two elements. The integrase gene of NBU2, intN2, was located at one end of the element. This gene was necessary and sufficient for the integration of NBU2. The integrase of NBU2 has the conserved amino acids (R-H-R-Y) in the C-terminal end that are found in members of the lambda family of site-specific integrases. This was also the only region in which the NBU1 and NBU2 integrases shared any similarity (28% amino acid sequence identity and 49% sequence similarity). Integration of NBU2 was site specific in Bacteroidesspecies. Integration occurred in two primary sites in B. thetaiotaomicron. Both of these sites were located in the 3′ end of a serine-tRNA gene NBU2 also integrated in Escherichia coli, but integration was much less site specific than inB. thetaiotaomicron. Analysis of the sequence of NBU2 revealed two potential antibiotic resistance genes. The amino acid sequences of the putative proteins encoded by these genes had similarity to resistances found in gram-positive bacteria. Only one of these genes was expressed in B. thetaiotaomicron, the homolog of linA, a lincomycin resistance gene fromStaphylococcus aureus. To determine how widespread elements related to NBU1 and NBU2 are in Bacteroides species, we screened 291 Bacteroides strains. Elements with some sequence similarity to NBU2 and NBU1 were widespread inBacteroides strains, and the presence oflinAN in Bacteroides strains was highly correlated with the presence of NBU2, suggesting that NBU2 has been responsible for the spread of this gene amongBacteroides strains. Our results suggest that the NBU-related elements form a large and heterogeneous family, whose members have similar integration mechanisms but have different target sites and differ in whether they carry resistance genes.

Download Full-text

Nucleotide and deduced amino acid sequences of the matrix (M) and fusion (F) protein genes of cetacean morbilliviruses isolated from a porpoise and a dolphin ☆

Virus Research ◽

10.1016/0168-1702(94)90129-5 ◽

1994 ◽

Vol 34 (3) ◽

pp. 291-304 ◽

Cited By ~ 27

Author(s):

Gert Bolt ◽

Merete Blixenkrone-Møller ◽

Elisabeth Gottschalck ◽

Richard G.A. Wishaupt ◽

Mark J. Welsh ◽

...

Keyword(s):

Amino Acid ◽

Amino Acid Sequences ◽

F Protein ◽

The Matrix

Download Full-text

Y-box proteins combine versatile cold shock domains and arginine-rich motifs (ARMs) for pleiotropic functions in RNA biology

Biochemical Journal ◽

10.1042/bcj20170956 ◽

2018 ◽

Vol 475 (17) ◽

pp. 2769-2784 ◽

Cited By ~ 7

Author(s):

Kenneth C. Kleene

Keyword(s):

Amino Acid ◽

Cold Shock ◽

Rna Binding ◽

Rna Binding Proteins ◽

Specific Binding ◽

Amino Acid Sequences ◽

Site Specific ◽

Terminal Domain ◽

Rna Backbone ◽

Specific Regulation

Y-box proteins are single-strand DNA- and RNA-binding proteins distinguished by a conserved cold shock domain (CSD) and a variable C-terminal domain organized into alternating short modules rich in basic or acidic amino acids. A huge literature depicts Y-box proteins as highly abundant, staggeringly versatile proteins that interact with all mRNAs and function in most forms of mRNA-specific regulation. The mechanisms by which Y-box proteins recognize mRNAs are unclear, because their CSDs bind a jumble of diverse elements, and the basic modules in the C-terminal domain are considered to bind nonspecifically to phosphates in the RNA backbone. A survey of vertebrate Y-box proteins clarifies the confusing names for Y-box proteins, their domains, and RNA-binding motifs, and identifies several novel conserved sequences: first, the CSD is flanked by linkers that extend its binding surface or regulate co-operative binding of the CSD and N-terminal and C-terminal domains to proteins and RNA. Second, the basic modules in the C-terminal domain are bona fide arginine-rich motifs (ARMs), because arginine is the predominant amino acid and comprises 99% of basic residues. Third, conserved differences in AA (amino acid) sequences between isoforms probably affect RNA-binding specificity. C-terminal ARMs connect with many studies, demonstrating that ARMs avidly bind sites containing specific RNA structures. ARMs crystallize insights into the under-appreciated contributions of the C-terminal domain to site-specific binding by Y-box proteins and difficulties in identifying site-specific binding by the C-terminal domain. Validated structural biology techniques are available to elucidate the mechanisms by which YBXprot (Y-box element-binding protein) CSDs and ARMs identify targets.

Download Full-text

Large-Scale Analyses of Site-Specific Evolutionary Rates across Eukaryote Proteomes Reveal Confounding Interactions between Intrinsic Disorder, Secondary Structure, and Functional Domains

Genes ◽

10.3390/genes9110553 ◽

2018 ◽

Vol 9 (11) ◽

pp. 553 ◽

Cited By ~ 7

Author(s):

Joseph Ahrens ◽

Jordon Rahaman ◽

Jessica Siltberg-Liberles

Keyword(s):

Amino Acid ◽

Secondary Structure ◽

Large Scale ◽

Intrinsic Disorder ◽

Secondary Structures ◽

Amino Acid Replacement ◽

Evolutionary Rates ◽

Functional Domains ◽

Site Specific ◽

Statistical Trends

Various structural and functional constraints govern the evolution of protein sequences. As a result, the relative rates of amino acid replacement among sites within a protein can vary significantly. Previous large-scale work on Metazoan (Animal) protein sequence alignments indicated that amino acid replacement rates are partially driven by a complex interaction among three factors: intrinsic disorder propensity; secondary structure; and functional domain involvement. Here, we use sequence-based predictors to evaluate the effects of these factors on site-specific sequence evolutionary rates within four eukaryotic lineages: Metazoans; Plants; Saccharomycete Fungi; and Alveolate Protists. Our results show broad, consistent trends across all four Eukaryote groups. In all four lineages, there is a significant increase in amino acid replacement rates when comparing: (i) disordered vs. ordered sites; (ii) random coil sites vs. sites in secondary structures; and (iii) inter-domain linker sites vs. sites in functional domains. Additionally, within Metazoans, Plants, and Saccharomycetes, there is a strong confounding interaction between intrinsic disorder and secondary structure—alignment sites exhibiting both high disorder propensity and involvement in secondary structures have very low average rates of sequence evolution. Analysis of gene ontology (GO) terms revealed that in all four lineages, a high fraction of sequences containing these conserved, disordered-structured sites are involved in nucleic acid binding. We also observe notable differences in the statistical trends of Alveolates, where intrinsically disordered sites are more variable than in other Eukaryotes and the statistical interactions between disorder and other factors are less pronounced.

Download Full-text