scholarly journals A coarse-graining, ultrametric approach to resolve the phylogeny of prokaryotic strains with frequent homologous recombination

2020 ◽  
Author(s):  
Tin Yau Pang

Abstract Background A frequent event in the evolution of prokaryotic genomes is homologous recombination, where a foreign DNA stretch replaces a genomic region similar in sequence. Recombination can affect the relative position of two genomes in a phylogenetic reconstruction in two different ways: (i) one genome can recombine with a DNA stretch that is similar to the other genome, thereby reducing their pairwise sequence divergence; (ii) one genome can recombine with a DNA stretch from an outgroup genome, increasing the pairwise divergence. While several recombination-aware phylogenetic algorithms exist, many of these cannot account for both types of recombination; some algorithms can, but do so inefficiently. Moreover, many of them reconstruct the ancestral recombination graph (ARG) to help infer the genome tree, and require that a substantial portion of each genome has not been affected by recombination, a sometimes unrealistic assumption. Methods Here, we propose a coarse-graining approach for phylogenetic reconstruction (CGP), which is recombination-aware but forgoes ARG reconstruction. It accounts for the tendency of a higher effective recombination rate between genomes with a lower phylogenetic distance. It is applicable even if all genomic regions have experienced substantial amounts of recombination, and can be used on both nucleotide and amino acid sequences. CGP considers the local density of substitutions along pairwise genome alignments, fitting a model to the empirical distribution of substitution density to infer the pairwise coalescent time. Given all pairwise coalescent times, CGP reconstructs an ultrametric tree representing vertical inheritance. Results Based on simulations, we show that the proposed approach can reconstruct ultrametric trees with accurate topology, branch lengths, and root positioning. Applied to a set of E. coli strains, the reconstructed trees are most consistent with gene distributions when inferred from amino acid sequences, a data type that cannot be utilized by many alternative approaches.Conclusions The CGP algorithm is more accurate than alternative recombination-aware methods for ultrametric phylogenetic reconstructions.

2020 ◽  
Author(s):  
Tin Yau Pang

Abstract Background A frequent event in the evolution of prokaryotic genomes is homologous recombination, where a foreign DNA stretch replaces a genomic region similar in sequence. Recombination can affect the relative position of two genomes in a phylogenetic reconstruction in two different ways: (i) one genome can recombine with a DNA stretch that is similar to the other genome, thereby reducing their pairwise sequence divergence; (ii) one genome can recombine with a DNA stretch from an outgroup genome, increasing the pairwise divergence. While several recombination-aware phylogenetic algorithms exist, many of these cannot account for both types of recombination; some algorithms can, but do so inefficiently. Moreover, many of them reconstruct the ancestral recombination graph (ARG) to help infer the genome tree, and require that a substantial portion of each genome has not been affected by recombination, a sometimes unrealistic assumption. Results Here, we propose a coarse-graining approach for phylogenetic reconstruction (CGP), which is recombination-aware but forgoes ARG reconstruction. It accounts for the tendency of a higher effective recombination rate between genomes with a lower phylogenetic distance. It is applicable even if all genomic regions have experienced substantial amounts of recombination, and can be used on both nucleotide and amino acid sequences. CGP considers the local density of substitutions along pairwise genome alignments, fitting a model to the empirical distribution of substitution density to infer the pairwise coalescent time. Given all pairwise coalescent times, CGP reconstructs an ultrametric tree representing vertical inheritance. Based on simulations, we show that the proposed approach can reconstruct ultrametric trees with accurate topology, branch lengths, and root positioning. Applied to a set of E. coli strains, the reconstructed trees are most consistent with gene distributions when inferred from amino acid sequences, a data type that cannot be utilized by many alternative approaches. Conclusions The CGP algorithm is more accurate than alternative recombination-aware methods for ultrametric phylogenetic reconstructions.


2019 ◽  
Author(s):  
Tin Yau Pang

Abstract Background: A frequent event in the evolution of prokaryotic genomes is homologous recombination, where a foreign DNA stretch replaces a genomic region similar in sequence. Recombination can affect the relative position of two genomes in a phylogenetic reconstruction in two different ways: (i) one genome can recombine with a DNA stretch that is similar to the other genome, thereby reducing their pairwise sequence divergence; (ii) one genome can recombine with a DNA stretch from an outgroup genome, increasing the pairwise divergence. While several recombination-aware phylogenetic algorithms exist, many of these cannot account for both types of recombination; some algorithms can, but do so inefficiently. Moreover, many of them reconstruct the ancestral recombination graph (ARG) to help infer the genome tree, and require that a substantial portion of each genome has not been affected by recombination, a sometimes unrealistic assumption.Results: Here, we propose a coarse-graining approach for phylogenetic reconstruction (CGP), which is recombination-aware but forgoes ARG reconstruction, applicable even if all genomic regions have experienced substantial amounts of recombination, and can be used on both nucleotide and amino acid sequences. CGP considers the local density of substitutions along pairwise genome alignments, fitting a model to the empirical distribution of substitution density to infer the pairwise coalescent time. Given all pairwise coalescent times, CGP reconstructs an ultrametric tree representing vertical inheritance. Based on simulations, we show that the proposed approach can reconstruct ultrametric trees with accurate topology, branch lengths, and root positioning. Applied to a set of E. coli strains, the reconstructed trees are most consistent with gene distributions when inferred from amino acid sequences, a data type that cannot be utilized by many alternative approaches.Conclusions The CGP algorithm is more accurate than alternative recombination-aware methods for ultrametric phylogenetic reconstructions.


2016 ◽  
Author(s):  
Tin Yau Pang

ABSTRACTA frequent event in the evolution of prokaryotic genomes is homologous recombination, where a foreign DNA stretch replaces a genomic region similar in sequence. Recombination can affect the relative position of two genomes in a phylogenetic reconstruction in two different ways: (i) one genome can recombine with a DNA stretch that is similar to the other genome, thereby reducing their pairwise sequence divergence; (ii) one genome can recombine with a DNA stretch from an outgroup genome, increasing the pairwise divergence. While several recombination-aware phylogenetic algorithms exist, many of these cannot account for both types of recombination; some algorithms can, but do so inefficiently. Moreover, many existing algorithms require that a substantial portion of each genome has not been affected by recombination, a sometimes unrealistic assumption. Here, we propose a novel coarse-graining approach for phylogenetic reconstruction (CGP), which is recombination-aware, applicable even if all genomic regions have experienced substantial amounts of recombination, and can be used on both nucleotide and amino acid sequences. CGP considers the local density of substitutions along pairwise genome alignments, fitting a model to the empirical distribution of substitution density to infer the pairwise coalescent time. Given all pairwise coalescent times, CGP reconstructs an ultrametric tree representing vertical inheritance. Based on simulations, we show that the proposed approach can reconstruct ultrametric trees with accurate topology, branch lengths, and root positioning. Applied to a set of E. coli strains, the reconstructed trees are most consistent with gene distributions when inferred from amino acid sequences, a data type that cannot be utilized by many alternative approaches.AUTHOR SUMMARYIn homologous recombination, segments of foreign DNA overwrite similar segments of a prokaryotic genome. A single recombination event can simultaneously introduce many DNA substitutions. This disturbs phylogenetic signals, making it difficult to reconstruct prokaryotic family trees. While a handful of recombination-aware phylogenetic algorithms have been proposed, most do not take all effects of recombination into account; others rely on the frequently unrealistic assumption that a substantial part of a genome has not been affected by recombination at all. Here, we introduce a novel approach to phylogenetic reconstruction, which estimates the age of the most recent common ancestor of two strains from the density distribution of DNA or amino acid substitutions between their genomes. The proposed phylogenetic tree is the tree most compatible with these age estimates. Based on nucleotide or amino acid sequences, our approach accurately predicts the topology, branch lengths, and root positioning of prokaryotic family trees.


2018 ◽  
Vol 44 (1) ◽  
pp. 20
Author(s):  
Eloiza Teles Caldart ◽  
Helena Mata ◽  
Cláudio Wageck Canal ◽  
Ana Paula Ravazzolo

Background: Phylogenetic analyses are an essential part in the exploratory assessment of nucleic acid and amino acid sequences. Particularly in virology, they are able to delineate the evolution and epidemiology of disease etiologic agents and/or the evolutionary path of their hosts. The objective of this review is to help researchers who want to use phylogenetic analyses as a tool in virology and molecular epidemiology studies, presenting the most commonly used methodologies, describing the importance of the different techniques, their peculiar vocabulary and some examples of their use in virology.Review: This article starts presenting basic concepts of molecular epidemiology and molecular evolution, emphasizing their relevance in the context of viral infectious diseases. It presents a session on the vocabulary relevant to the subject, bringing readers to a minimum level of knowledge needed throughout this literature review. Within its main subject, the text explains what a molecular phylogenetic analysis is, starting from a multiple alignment of nucleotide or amino acid sequences. The different software used to perform multiple alignments may apply different algorithms. To build a phylogeny based on amino acid or nucleotide sequences it is necessary to produce a data matrix based on a model for nucleotide or amino acid replacement, also called evolutionary model. There are a number of evolutionary models available, varying in complexity according to the number of parameters (transition, transversion, GC content, nucleotide position in the codon, among others). Some papers presented herein provide techniques that can be used to choose evolutionary models. After the model is chosen, the next step is to opt for a phylogenetic reconstruction method that best fits the available data and the selected model. Here we present the most common reconstruction methods currently used, describing their principles, advantages and disadvantages. Distance methods, for example, are simpler and faster, however, they do not provide reliable estimations when the sequences are highly divergent. The accuracy of the analysis with probabilistic models (neighbour joining, maximum likelihood and bayesian inference) strongly depends on the adherence of the actual data to the chosen development model. Finally, we also explore topology confidence tests, especially the most used one, the bootstrap. To assist the reader, this review presents figures to explain specific situations discussed in the text and numerous examples of previously published scientific articles in virology that demonstrate the importance of the techniques discussed herein, as well as their judicious use.Conclusion: The DNA sequence is not only a record of phylogeny and divergence times, but also keeps signs of how the evolutionary process has shaped its history and also the elapsed time in the evolutionary process of the population. Analyses of genomic sequences by molecular phylogeny have demonstrated a broad spectrum of applications. It is important to note that for the different available data and different purposes of phylogenies, reconstruction methods and evolutionary models should be wisely chosen. This review provides theoretical basis for the choice of evolutionary models and phylogenetic reconstruction methods best suited to each situation. In addition, it presents examples of diverse applications of molecular phylogeny in virology.


1987 ◽  
Vol 7 (6) ◽  
pp. 2231-2242 ◽  
Author(s):  
J E Rudolph ◽  
M Kimble ◽  
H D Hoyle ◽  
M A Subler ◽  
E C Raff

The genomic DNA sequence and deduced amino acid sequence are presented for three Drosophila melanogaster beta-tubulins: a developmentally regulated isoform beta 3-tubulin, the wild-type testis-specific isoform beta 2-tubulin, and an ethyl methanesulfonate-induced assembly-defective mutation of the testis isoform, B2t8. The testis-specific beta 2-tubulin is highly homologous to the major vertebrate beta-tubulins, but beta 3-tubulin is considerably diverged. Comparison of the amino acid sequences of the two Drosophila isoforms to those of other beta-tubulins indicates that these two proteins are representative of an ancient sequence divergence event which at least preceded the split between lines leading to vertebrates and invertebrates. The intron/exon structures of the genes for beta 2- and beta 3-tubulin are not the same. The structure of the gene for the variant beta 3-tubulin isoform, but not that of the testis-specific beta 2-tubulin gene, is similar to that of vertebrate beta-tubulins. The mutation B2t8 in the gene for the testis-specific beta 2-tubulin defines a single amino acid residue required for normal assembly function of beta-tubulin. The sequence of the B2t8 gene is identical to that of the wild-type gene except for a single nucleotide change resulting in the substitution of lysine for glutamic acid at residue 288. This position falls at the junction between two major structural domains of the beta-tubulin molecule. Although this hinge region is relatively variable in sequence among different beta-tubulins, the residue corresponding to glu 288 of Drosophila beta 2-tubulin is highly conserved as an acidic amino acid not only in all other beta-tubulins but in alpha-tubulins as well.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3391 ◽  
Author(s):  
Dariya K. Sydykova ◽  
Claus O. Wilke

Site-specific evolutionary rates can be estimated from codon sequences or from amino-acid sequences. For codon sequences, the most popular methods use some variation of the dN∕dS ratio. For amino-acid sequences, one widely-used method is called Rate4Site, and it assigns a relative conservation score to each site in an alignment. How site-wise dN∕dS values relate to Rate4Site scores is not known. Here we elucidate the relationship between these two rate measurements. We simulate sequences with known dN∕dS, using either dN∕dS models or mutation–selection models for simulation. We then infer Rate4Site scores on the simulated alignments, and we compare those scores to either true or inferred dN∕dS values on the same alignments. We find that Rate4Site scores generally correlate well with true dN∕dS, and the correlation strengths increase in alignments with greater sequence divergence and more taxa. Moreover, Rate4Site scores correlate very well with inferred (as opposed to true) dN∕dS values, even for small alignments with little divergence. Finally, we verify this relationship between Rate4Site and dN∕dS in a variety of empirical datasets. We conclude that codon-level and amino-acid-level analysis frameworks are directly comparable and yield very similar inferences.


1987 ◽  
Vol 7 (6) ◽  
pp. 2231-2242
Author(s):  
J E Rudolph ◽  
M Kimble ◽  
H D Hoyle ◽  
M A Subler ◽  
E C Raff

The genomic DNA sequence and deduced amino acid sequence are presented for three Drosophila melanogaster beta-tubulins: a developmentally regulated isoform beta 3-tubulin, the wild-type testis-specific isoform beta 2-tubulin, and an ethyl methanesulfonate-induced assembly-defective mutation of the testis isoform, B2t8. The testis-specific beta 2-tubulin is highly homologous to the major vertebrate beta-tubulins, but beta 3-tubulin is considerably diverged. Comparison of the amino acid sequences of the two Drosophila isoforms to those of other beta-tubulins indicates that these two proteins are representative of an ancient sequence divergence event which at least preceded the split between lines leading to vertebrates and invertebrates. The intron/exon structures of the genes for beta 2- and beta 3-tubulin are not the same. The structure of the gene for the variant beta 3-tubulin isoform, but not that of the testis-specific beta 2-tubulin gene, is similar to that of vertebrate beta-tubulins. The mutation B2t8 in the gene for the testis-specific beta 2-tubulin defines a single amino acid residue required for normal assembly function of beta-tubulin. The sequence of the B2t8 gene is identical to that of the wild-type gene except for a single nucleotide change resulting in the substitution of lysine for glutamic acid at residue 288. This position falls at the junction between two major structural domains of the beta-tubulin molecule. Although this hinge region is relatively variable in sequence among different beta-tubulins, the residue corresponding to glu 288 of Drosophila beta 2-tubulin is highly conserved as an acidic amino acid not only in all other beta-tubulins but in alpha-tubulins as well.


Genome ◽  
1988 ◽  
Vol 30 (3) ◽  
pp. 341-346
Author(s):  
G. Brian Golding

The divergence of immunoglobulin genes due to somatic mutation provides a natural example of DNA sequence divergence. This divergence was examined to gain insight into the processes of evolution and the determinants of the variance-to-mean ratio of sequence divergence. Normally, this ratio is found to be larger than expected (1.0 under Poisson assumptions) for the evolutionary divergence or most genes. Although not significantly less than one, all seven groups of immunoglobulin amino acid sequences have ratios smaller than expected, contrary to the evolutionary pattern generally observed. The substitutions in the immunoglobulin genes appear to be highly nonrandom and an excess of parallel changes (the major nonrandom feature of these mutations) is shown to cause smaller ratios. Because convergent or parallel mutations are often observed in the evolutionary divergence of genes, this suggests that forces causing the large observed ratios may actually have to be more powerful than previously expected. Further, since selection is one of the likely causes of parallel mutations, it should be noted that selection could significantly decrease the variance-to-mean ratio. The high frequency of parallel mutations and their resulting effects, as observed in the immunoglobulin genes, suggest that only poor inferences of sequence divergence can be made without actual knowledge of the ancestral sequence.Key words: molecular evolution, parallel mutations, neutral allele theory, sequence divergence, immunoglobulin, somatic mutations.


2016 ◽  
Author(s):  
Galya V. Klink ◽  
Georgii A. Bazykin

AbstractAmino acid propensities at amino acid sites change with time due to epistatic interactions or changing environment, affecting the probabilities of fixation of different amino acids. Such changes should lead to an increased rate of homoplasies (reversals, parallelisms, and convergences) at closely related species. Here, we reconstruct the phylogeny of twelve mitochondrial proteins from several thousand metazoan species, and measure the phylogenetic distances between branches at which either the same allele originated repeatedly due to homoplasies, or different alleles originated due to divergent substitutions. The mean phylogenetic distance between parallel substitutions is ∼20% lower than the mean phylogenetic distance between divergent substitutions, indicating that a variant fixed in a species is more likely to be deleterious in a more phylogenetically remote species, compared to a more closely related species. These findings are robust to artefacts of phylogenetic reconstruction or of pooling of sites from different conservation classes or functional groups, and imply that single-position fitness landscapes change at rates similar to rates of amino acid changes.


2017 ◽  
Author(s):  
Dariya K. Sydykova ◽  
Claus O Wilke

Site-specific evolutionary rates can be estimated from codon sequences or from amino-acid sequences. For codon sequences, the most popular methods use some variation of the dN/dS ratio. For amino-acid sequences, one widely-used method is called Rate4Site, and it assigns a relative conservation score to each site in an alignment. How site-wise dN/dS values relate to Rate4Site scores is not known. Here we elucidate the relationship between these two rate measurements. We simulate sequences with known dN/dS, using either dN/dS models or mutation--selection models for simulation. We then infer Rate4Site scores on the simulated alignments, and we compare those scores to either true or inferred dN/dS values on the same alignments. We find that Rate4Site scores generally correlate well with true dN/dS, and the correlation strengths increase in alignments with higher sequence divergence and higher number of taxa. Moreover, Rate4Site scores correlate nearly perfectly with inferred dN/dS values, even for small alignments with little divergence. Finally, we verify this relationship between Rate4Site and dN/dS in a variety of natural sequence alignments. We conclude that codon-level and amino-acid-level analysis frameworks are directly comparable and yield near-identical inferences.


Sign in / Sign up

Export Citation Format

Share Document