scholarly journals Consequences of Recombination on Traditional Phylogenetic Analysis

Genetics ◽  
2000 ◽  
Vol 156 (2) ◽  
pp. 879-891 ◽  
Author(s):  
Mikkel H Schierup ◽  
Jotun Hein

Abstract We investigate the shape of a phylogenetic tree reconstructed from sequences evolving under the coalescent with recombination. The motivation is that evolutionary inferences are often made from phylogenetic trees reconstructed from population data even though recombination may well occur (mtDNA or viral sequences) or does occur (nuclear sequences). We investigate the size and direction of biases when a single tree is reconstructed ignoring recombination. Standard software (PHYLIP) was used to construct the best phylogenetic tree from sequences simulated under the coalescent with recombination. With recombination present, the length of terminal branches and the total branch length are larger, and the time to the most recent common ancestor smaller, than for a tree reconstructed from sequences evolving with no recombination. The effects are pronounced even for small levels of recombination that may not be immediately detectable in a data set. The phylogenies when recombination is present superficially resemble phylogenies for sequences from an exponentially growing population. However, exponential growth has a different effect on statistics such as Tajima's D. Furthermore, ignoring recombination leads to a large overestimation of the substitution rate heterogeneity and the loss of the molecular clock. These results are discussed in relation to viral and mtDNA data sets.


mBio ◽  
2016 ◽  
Vol 7 (3) ◽  
Author(s):  
Xavier Didelot ◽  
Janina Dordel ◽  
Lilith K. Whittles ◽  
Caitlin Collins ◽  
Nicole Bilek ◽  
...  

ABSTRACT Gonorrhea is a sexually transmitted disease causing growing concern, with a substantial increase in reported incidence over the past few years in the United Kingdom and rising levels of resistance to a wide range of antibiotics. Understanding its epidemiology is therefore of major biomedical importance, not only on a population scale but also at the level of direct transmission. However, the molecular typing techniques traditionally used for gonorrhea infections do not provide sufficient resolution to investigate such fine-scale patterns. Here we sequenced the genomes of 237 isolates from two local collections of isolates from Sheffield and London, each of which was resolved into a single type using traditional methods. The two data sets were selected to have different epidemiological properties: the Sheffield data were collected over 6 years from a predominantly heterosexual population, whereas the London data were gathered within half a year and strongly associated with men who have sex with men. Based on contact tracing information between individuals in Sheffield, we found that transmission is associated with a median time to most recent common ancestor of 3.4 months, with an upper bound of 8 months, which we used as a criterion to identify likely transmission links in both data sets. In London, we found that transmission happened predominantly between individuals of similar age, sexual orientation, and location and also with the same HIV serostatus, which may reflect serosorting and associated risk behaviors. Comparison of the two data sets suggests that the London epidemic involved about ten times more cases than the Sheffield outbreak. IMPORTANCE The recent increases in gonorrhea incidence and antibiotic resistance are cause for public health concern. Successful intervention requires a better understanding of transmission patterns, which is not uncovered by traditional molecular epidemiology techniques. Here we studied two outbreaks that took place in Sheffield and London, United Kingdom. We show that whole-genome sequencing provides the resolution to investigate direct gonorrhea transmission between infected individuals. Combining genome sequencing with rich epidemiological information about infected individuals reveals the importance of several transmission routes and risk factors, which can be used to design better control measures.



Paleobiology ◽  
1997 ◽  
Vol 23 (1) ◽  
pp. 1-19 ◽  
Author(s):  
William C. Clyde ◽  
Daniel C. Fisher

Stratigraphic data are compared to morphologic data in terms of their fit to phylogenetic hypotheses for 29 data sets taken from the literature. Stratigraphic fit is measured using MacClade's stratigraphic character, which tracks the number of independent discrepancies between observed order and the order of occurrence that would be expected on the basis of a given phylogenetic hypothesis. Acceptance of a phylogenetic hypothesis despite such discrepancies requires ad hoc hypotheses concerning differential probabilities of preservation and recovery. These stratigraphic ad hoc hypotheses are treated as logically equivalent to morphologic ad hoc hypotheses of homoplasy. The retention index is used to compare the number of stratigraphic and morphologic ad hoc hypotheses required by given phylogenetic hypotheses. Each data set is subjected to five analyses, varying in the constraints imposed on the structure of the phylogenetic tree against which fit is measured. Analyses 1–4 compare the stratigraphic and morphologic retention indices using phylogenetic trees consistent with the morphologically most-parsimonious cladogram reported in the original study. Analysis 5 compares retention indices using the overall (stratigraphically and morphologically) most-parsimonious phylogenetic tree, which may be, but is not necessarily, consistent with the reported cladogram. Proceeding from Analysis 1 to Analysis 5, stratigraphic data are allowed greater influence in determining the structure of phylogenetic trees, with the trees in Analysis 1 derived without reference to the stratigraphic character and the trees in Analysis 5 derived from full interaction of stratigraphic and morphologic characters. Morphologic and stratigraphic retention indices for these 29 studies cannot be statistically distinguished in comparisons 3–5, suggesting very similar degrees of fit. The values of these retention indices are high, indicating a generally high level of congruence under these phylogenetic hypotheses. Significant gains (49%) in stratigraphic fit can be realized without significant loss (4%) in morphologic fit as the stratigraphic and morphologic evidence are both allowed to participate in constraining the structure of phylogenetic hypotheses. These results suggest that arguments based on alleged “noisiness” of stratigraphic data offer inadequate grounds for ignoring stratigraphic order in phylogenetic analysis. In terms of congruence, stratigraphic and morphologic data perform about equally well.



Author(s):  
Ben Bettisworth ◽  
Alexandros Stamatakis

AbstractSummaryIn phylogenetic analysis, it is common to infer unrooted trees. Thus, it is unknown which node is the most recent common ancestor of all the taxa in the phylogeny. However, knowing the root location is desirable for downstream analyses and interpretation. There exist several methods to recover a root, such as midpoint rooting or rooting the tree at an outgroup. Non-reversible Markov models can also be used to compute the likelihood of a potential root position. We present a software called RootDigger which uses a non-reversible Markov model to compute the most likely root location on a given tree and to infer a confidence value for each possible root placement.Availability and implementationRootDigger is available under the MIT licence at https://github.com/computations/root_digger



2020 ◽  
Author(s):  
Marika Kaden ◽  
Katrin Sophie Bohnsack ◽  
Mirko Weber ◽  
Mateusz Kudła ◽  
Kaja Gutowska ◽  
...  

AbstractWe present an approach to investigate SARS-CoV-2 virus sequences based on alignment-free methods for RNA sequence comparison. In particular, we verify a given clustering result for the GISAID data set, which was obtained analyzing the molecular differences in coronavirus populations by phylogenetic trees. For this purpose, we use alignment-free dissimilarity measures for sequences and combine them with learning vector quantization classifiers for virus type discriminant analysis and classification. Those vector quantizers belong to the class of interpretable machine learning methods, which, on the one hand side provide additional knowledge about the classification decisions like discriminant feature correlations, and on the other hand can be equipped with a reject option. This option gives the model the property of self controlled evidence if applied to new data, i.e. the models refuses to make a classification decision, if the model evidence for the presented data is not given. After training such a classifier for the GISAID data set, we apply the obtained classifier model to another but unlabeled SARS-CoV-2 virus data set. On the one hand side, this allows us to assign new sequences to already known virus types and, on the other hand, the rejected sequences allow speculations about new virus types with respect to nucleotide base mutations in the viral sequences.Author summaryThe currently emerging global disease COVID-19 caused by novel SARS-CoV-2 viruses requires all scientific effort to investigate the development of the viral epidemy, the properties of the virus and its types. Investigations of the virus sequence are of special interest. Frequently, those are based on mathematical/statistical analysis. However, machine learning methods represent a promising alternative, if one focuses on interpretable models, i.e. those that do not act as black-boxes. Doing so, we apply variants of Learning Vector Quantizers to analyze the SARS-CoV-2 sequences. We encoded the sequences and compared them in their numerical representations to avoid the computationally costly comparison based on sequence alignments. Our resulting model is interpretable, robust, efficient, and has a self-controlling mechanism regarding the applicability to data. This framework was applied to two data sets concerning SARS-CoV-2. We were able to verify previously published virus type findings for one of the data sets by training our model to accurately identify the virus type of sequences. For sequences without virus type information (second data set), our trained model can predict them. Thereby, we observe a new scattered spreading of the sequences in the data space which probably is caused by mutations in the viral sequences.



2021 ◽  
Author(s):  
Scott T. Small ◽  
John P. Wares

AbstractKnowledge of species ages and their distribution enhance our understanding of processes that create and maintain species diversity at both local and regional levels. The largest family of freshwater mussels (Unionidae), reach their highest species diversity in drainages of the southeastern united states. By sequencing multiple loci from mussel species distributed throughout the drainages in this region, we attempt to uncover historical patterns of divergence and determine the role of vicariance events on the species formation in mussels and extend our hypothesis to freshwater animals in general. We analyzed 346 sequences from five genera encompassing 37 species. Species were sampled across 12 distinct drainages ending either in the Atlantic Ocean or the Gulf of Mexico. Overall the topologies of the different genera returned phylogenetic trees that were congruent with geographically contiguous drainages. The most common pattern was the grouping between the Atlantic slope and gulf coast drainages, however the Tennessee drainage was often the exception to this pattern grouping with the Atlantic slope. Most mussel species find a most recent common ancestor within a drainage before finding an ancestor between drainages. This supports the hypothesis of allopatric divergence followed by later burst of speciation within a drainage. Our estimated divergence times for the Atlantic-Gulf split agree with other studies estimating vicariance in fish species of the Atlantic and gulf coast.



2016 ◽  
Author(s):  
Marguerite Lapierre ◽  
Amaury Lambert ◽  
Guillaume Achaz

AbstractSome methods for demographic inference based on the observed genetic diversity of current populations rely on the use of summary statistics such as the Site Frequency Spectrum (SFS). Demographic models can be either model-constrained with numerous parameters such as growth rates, timing of demographic events and migration rates, or model-flexible, with an unbounded collection of piecewise constant sizes. It is still debated whether demographic histories can be accurately inferred based on the SFS. Here we illustrate this theoretical issue on an example of demographic inference for an African population. The SFS of the Yoruba population (data from the 1000 Genomes Project) is fit to a simple model of population growth described with a single parameter (e.g., founding time). We infer a time to the most recent common ancestor of 1.7 million years for this population. However, we show that the Yoruba SFS is not informative enough to discriminate between several different models of growth. We also show that for such simple demographies, the fit of one-parameter models outperforms the model-flexible method recently developed by Liu and Fu. The use of this method on simulated data suggests that it is biased by the noise intrinsically present in the data.



1983 ◽  
Vol 38 (1-2) ◽  
pp. 156-158 ◽  
Author(s):  
Geert De Soete

An iterative algorithm for constructing the optimal phylogenetic tree from a given set o f dissimilarity data is described. The procedure is applied for illustrative purposes an a data set com piled by Fitch and Margoliash.



2019 ◽  
Vol 37 (4) ◽  
pp. 1202-1210 ◽  
Author(s):  
David A Duchêne ◽  
K Jun Tong ◽  
Charles S P Foster ◽  
Sebastián Duchêne ◽  
Robert Lanfear ◽  
...  

Abstract Evolution leaves heterogeneous patterns of nucleotide variation across the genome, with different loci subject to varying degrees of mutation, selection, and drift. In phylogenetics, the potential impacts of partitioning sequence data for the assignment of substitution models are well appreciated. In contrast, the treatment of branch lengths has received far less attention. In this study, we examined the effects of linking and unlinking branch-length parameters across loci or subsets of loci. By analyzing a range of empirical data sets, we find consistent support for a model in which branch lengths are proportionate between subsets of loci: gene trees share the same pattern of branch lengths, but form subsets that vary in their overall tree lengths. These models had substantially better statistical support than models that assume identical branch lengths across gene trees, or those in which genes form subsets with distinct branch-length patterns. We show using simulations and empirical data that the complexity of the branch-length model with the highest support depends on the length of the sequence alignment and on the numbers of taxa and loci in the data set. Our findings suggest that models in which branch lengths are proportionate between subsets have the highest statistical support under the conditions that are most commonly seen in practice. The results of our study have implications for model selection, computational efficiency, and experimental design in phylogenomics.



2003 ◽  
Vol 17 (4) ◽  
pp. 605 ◽  
Author(s):  
Philip S. Ward ◽  
Seán G. Brady

We investigated phylogenetic relationships among the 'primitive' Australian ant genera Myrmecia and Nothomyrmecia (stat. rev.) and the Baltic amber fossil genus Prionomyrmex, using a combination of morphological and molecular data. Outgroups for the analysis included representatives from a variety of potential sister-groups, including five extant subfamilies of ants and one extinct group (Sphecomyrminae). Parsimony analysis of the morphological data provides strong support (~95% bootstrap proportions) for the monophyly of (1) genus Myrmecia, (2) genus Prionomyrmex, and (3) a clade containing those two genera plus Nothomyrmecia. A group comprising Nothomyrmecia and Prionomyrmex is also upheld (85% bootstrap support). Molecular sequence data (~2200 base pairs from the 18S and 28S ribosomal RNA genes) corroborate these findings for extant taxa, with Myrmecia and Nothomyrmecia appearing as sister-groups with ~100% bootstrap support under parsimony, neighbour-joining and maximum-likelihood analyses. Neither the molecular nor the morphological data set allows us to identify unambiguously the sister-group of (Myrmecia + (Nothomyrmecia + Prionomyrmex)). Rather, Myrmecia and relatives are part of an unresolved polytomy that encompasses most of the ant subfamilies. Taken as a whole, our results support the contention that many of the major lineages of ants – including a clade that later came to contain Myrmecia, Nothomyrmecia and Prionomyrmex – arose at around the same time during a bout of diversification in the middle or late Cretaceous. On the basis of Bayesian dating analysis, the estimated age of the most recent common ancestor of Myrmecia and Nothomyrmecia is 74 million years (95% confidence limits, 53–101�million years), a result consistent with the origin of the myrmeciine stem lineage in the Cretaceous. The ant subfamily Myrmeciinae is redefined to contain two tribes, Myrmeciini (genus Myrmecia) and Prionomyrmecini (Nothomyrmecia and Prionomyrmex). Phylogenetic analysis of the enigmatic Argentine fossils Ameghinoia and Polanskiella demonstrates that they are also members of the Myrmeciinae, probably more closely related to Prionomyrmecini than to Myrmeciini. Thus, the myrmeciine ants appear to be a formerly widespread group that retained many ancestral formicid characteristics and that became extinct everywhere except in the Australian region.



2021 ◽  
Vol 7 (2) ◽  
pp. 179-187
Author(s):  
Sa'diatul Fuadiyah ◽  
Topik Hidayat ◽  
Didik Priyandoko

The student's ability to understand evolutionary studies is determined by representing a phylogenetic tree or cladogram. This study aims to determine the tree thinking ability, especially the students' reading ability in interpreting the cladogram. This descriptive study involved 29 students as subjects. Students are selected by purposive random sampling, only students who have attended and studied evolution courses. The data collection instrument used tests and interview guidelines. The test questions consist of 20 multiple choice questions with five answer choices. The difficulty level of the questions used includes understanding, applying, analyzing, and evaluating. The phylogenetic tree interpretation refers to four indicators, including the most recent common ancestor (MRCA), monophyletic group, branch proximity, contemporary descendant, and counting the branch or nodes position. The data obtained were analyzed using Microsoft Excel 2013 and Anates-V4, then presented in percentage form. The results showed that many students misinterpreted the cladogram. Furthermore, errors in cladogram interpretation occurred in monophyletic group indicators (38%), most common ancestor (59%), branch proximity (41%), contemporary ancestry (39%), and branch position calculations (53%). These results indicate that misreading of analysis in cladogram interpretation is moderate to high, so it is necessary to formulate the most appropriate way to teach phylogenetic studies in evolution.



Sign in / Sign up

Export Citation Format

Share Document