Bayesian Inference of Joint Coalescence Times for Sampled Sequences

Branch Lengths ◽

Site Frequency Spectrum ◽

The site frequency spectrum (SFS) is a commonly used statistic to summarize genetic variation in a sample of genomic sequences from a population. Such a genomic sample is associated with an imputed genealogical history with attributes such as branch lengths, coalescence times and the time to the most recent common ancestor (TMRCA) as well as topological and combinatorial properties. We present a Bayesian model for sampling from the joint posterior distribution of coalescence times conditional on the SFS associated with a sample of sequences in the absence of selection. In this model, the combinatorial properties of a genealogy, which is represented as a coalescent tree, are expressed as matrices. This facilitates the calculation of likelihoods and the effective sampling of the entire space of tree structures according to the Equal Rates Markov (or Yule-type) measure. Unlike previous methods, assumptions as to the type of stochastic process that generated the genealogical tree are not required. Novel approaches to defining both uninformative and informative prior distributions are employed. The uncertainty in inference due to the stochastic nature of mutation and the unknown tree structure is expressed by the shape of the posterior distributions. The method is implemented using the general purpose Markov Chain Monte Carlo software PyMC3. From the sampled posterior distribution of coalescence times, one can also infer related quantities such as the number of ancestors of a sample at a given time in the past (ancestral distribution) and the probability of specific relationships between branch lengths (for example, that the most recent branch is longer than all the others). The performance of the method is evaluated against simulated data and is also applied to historic mitochondrial data from the Nuu-Chah-Nulth people of North America. The method can be used to obtain estimates of the TMRCA of the sample. The relationship of these estimates to those given by ''Thomson's estimator'' is explored. Keywords: coalescent theory; Bayesian inference; time to most recent common ancestor; site frequency spectrum

Genetic variability under the seedbank coalescent

10.1101/017244 ◽

2015 ◽

Author(s):

Jochen Blath ◽

Bjarki Eldon ◽

Adrian Casanova ◽

Noemi Kurt ◽

Maite Wilke-Berenguer

Keyword(s):

Genetic Variability ◽

Frequency Spectrum ◽

Common Ancestor ◽

Scaling Limit ◽

Natural Populations ◽

Recent Common Ancestor ◽

Active Population ◽

Large Populations ◽

Site Frequency Spectrum

We analyse patterns of genetic variability of populations in the presence of a large seedbank with the help of a new coalescent structure called the seedbank coalescent. This ancestral process appears naturally as scaling limit of the genealogy of large populations that sustain seedbanks, if the seedbank size and individual dormancy times are of the same order as the active population. Mutations appear as Poisson processes on the active lineages, and potentially at reduced rate also on the dormant lineages. The presence of `dormant' lineages leads to qualitatively altered times to the most recent common ancestor and non-classical patterns of genetic diversity. To illustrate this we provide a Wright-Fisher model with seedbank component and mutation, motivated from recent models of microbial dormancy, whose genealogy can be described by the seedbank coalescent. Based on our coalescent model, we derive recursions for the expectation and variance of the time to most recent common ancestor, number of segregating sites, pairwise differences, and singletons. Estimates (obtained by simulations) of the distributions of commonly employed distance statistics, in the presence and absence of a seedbank, are compared. The effect of a seedbank on the expected site-frequency spectrum is also investigated using simulations. Our results indicate that the presence of a large seedbank considerably alters the distribution of some distance statistics, as well as the site-frequency spectrum. Thus, one should be able to detect from genetic data the presence of a large seedbank in natural populations.

Allelic Genealogies in Sporophytic Self-Incompatibility Systems in Plants

Genetics ◽

10.1093/genetics/150.3.1187 ◽

1998 ◽

Vol 150 (3) ◽

pp. 1187-1198 ◽

Cited By ~ 9

Author(s):

Mikkel H Schierup ◽

Xavier Vekemans ◽

Freddy B Christiansen

Keyword(s):

Dominance Hierarchy ◽

Common Ancestor ◽

Recent Common Ancestor ◽

Self Incompatibility ◽

Dominance Interactions ◽

Turnover Process ◽

Neutral Gene ◽

Interspecific Comparisons ◽

Abstract Expectations for the time scale and structure of allelic genealogies in finite populations are formed under three models of sporophytic self-incompatibility. The models differ in the dominance interactions among the alleles that determine the self-incompatibility phenotype: In the SSIcod model, alleles act codominantly in both pollen and style, in the SSIdom model, alleles form a dominance hierarchy, and in SSIdomcod, alleles are codominant in the style and show a dominance hierarchy in the pollen. Coalescence times of alleles rarely differ more than threefold from those under gametophytic self-incompatibility, and transspecific polymorphism is therefore expected to be equally common. The previously reported directional turnover process of alleles in the SSIdomcod model results in coalescence times lower and substitution rates higher than those in the other models. The SSIdom model assumes strong asymmetries in allelic action, and the most recessive extant allele is likely to be the most recent common ancestor. Despite these asymmetries, the expected shape of the allele genealogies does not deviate markedly from the shape of a neutral gene genealogy. The application of the results to sequence surveys of alleles, including interspecific comparisons, is discussed.

Accuracy of demographic inferences from site frequency spectrum: the case of the yoruba population

10.1101/078618 ◽

2016 ◽

Author(s):

Marguerite Lapierre ◽

Amaury Lambert ◽

Guillaume Achaz

Keyword(s):

Frequency Spectrum ◽

Simulated Data ◽

Population Data ◽

Recent Common Ancestor ◽

Theoretical Issue ◽

Demographic Models ◽

Site Frequency Spectrum ◽

Demographic Inference ◽

And Migration

AbstractSome methods for demographic inference based on the observed genetic diversity of current populations rely on the use of summary statistics such as the Site Frequency Spectrum (SFS). Demographic models can be either model-constrained with numerous parameters such as growth rates, timing of demographic events and migration rates, or model-flexible, with an unbounded collection of piecewise constant sizes. It is still debated whether demographic histories can be accurately inferred based on the SFS. Here we illustrate this theoretical issue on an example of demographic inference for an African population. The SFS of the Yoruba population (data from the 1000 Genomes Project) is fit to a simple model of population growth described with a single parameter (e.g., founding time). We infer a time to the most recent common ancestor of 1.7 million years for this population. However, we show that the Yoruba SFS is not informative enough to discriminate between several different models of growth. We also show that for such simple demographies, the fit of one-parameter models outperforms the model-flexible method recently developed by Liu and Fu. The use of this method on simulated data suggests that it is biased by the noise intrinsically present in the data.

Coalescence Times for the Bienaymé-Galton-Watson Process

Journal of Applied Probability ◽

10.1017/s0021900200010184 ◽

2014 ◽

Vol 51 (01) ◽

pp. 209-218

Author(s):

V. Le

Keyword(s):

Continuous Time ◽

Joint Distribution ◽

Common Ancestor ◽

Recent Common Ancestor ◽

Subcritical Case ◽

Current Generation ◽

Watson Process ◽

Number Of Individuals ◽

We investigate the distribution of the coalescence time (most recent common ancestor) for two individuals picked at random (uniformly) in the current generation of a continuous-time Bienaymé-Galton-Watson process foundedtunits of time ago. We also obtain limiting distributions ast→ ∞ in the subcritical case. We extend our results for two individuals to the joint distribution of coalescence times for any finite number of individuals sampled in the current generation.

Coalescence Times for the Bienaymé-Galton-Watson Process

Journal of Applied Probability ◽

10.1239/jap/1395771424 ◽

2014 ◽

Vol 51 (1) ◽

pp. 209-218 ◽

Cited By ~ 6

Author(s):

V. Le

Keyword(s):

Continuous Time ◽

Joint Distribution ◽

Common Ancestor ◽

Recent Common Ancestor ◽

Subcritical Case ◽

Current Generation ◽

Watson Process ◽

Number Of Individuals ◽

We investigate the distribution of the coalescence time (most recent common ancestor) for two individuals picked at random (uniformly) in the current generation of a continuous-time Bienaymé-Galton-Watson process founded t units of time ago. We also obtain limiting distributions as t → ∞ in the subcritical case. We extend our results for two individuals to the joint distribution of coalescence times for any finite number of individuals sampled in the current generation.

The probability of monophyly of a sample of gene lineages on a species tree

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1601074113 ◽

2016 ◽

Vol 113 (29) ◽

pp. 8002-8009 ◽

Cited By ~ 11

Author(s):

Rohan S. Mehta ◽

David Bryant ◽

Noah A. Rosenberg

Keyword(s):

Species Delimitation ◽

Common Ancestor ◽

Species Group ◽

Species Tree ◽

Recent Common Ancestor ◽

Evolutionary Models ◽

Tree Model ◽

Branch Lengths ◽

Multiple Species

Monophyletic groups—groups that consist of all of the descendants of a most recent common ancestor—arise naturally as a consequence of descent processes that result in meaningful distinctions between organisms. Aspects of monophyly are therefore central to fields that examine and use genealogical descent. In particular, studies in conservation genetics, phylogeography, population genetics, species delimitation, and systematics can all make use of mathematical predictions under evolutionary models about features of monophyly. One important calculation, the probability that a set of gene lineages is monophyletic under a two-species neutral coalescent model, has been used in many studies. Here, we extend this calculation for a species tree model that contains arbitrarily many species. We study the effects of species tree topology and branch lengths on the monophyly probability. These analyses reveal new behavior, including the maintenance of nontrivial monophyly probabilities for gene lineage samples that span multiple species and even for lineages that do not derive from a monophyletic species group. We illustrate the mathematical results using an example application to data from maize and teosinte.

On the Use of Star-Shaped Genealogies in Inference of Coalescence Times

Genetics ◽

10.1093/genetics/164.4.1677 ◽

2003 ◽

Vol 164 (4) ◽

pp. 1677-1682 ◽

Cited By ~ 1

Author(s):

Noah A Rosenberg ◽

Aaron E Hirsh

Keyword(s):

Growth Rate ◽

Population Size ◽

Exponential Growth ◽

Common Ancestor ◽

Pairwise Comparison ◽

Recent Common Ancestor ◽

Large Populations ◽

Tree Length ◽

AbstractGenealogies from rapidly growing populations have approximate “star” shapes. We study the degree to which this approximation holds in the context of estimating the time to the most recent common ancestor (TMRCA) of a set of lineages. In an exponential growth scenario, we find that unless the product of population size (N) and growth rate (r) is at least ∼105, the “pairwise comparison estimator” of TMRCA that derives from the star genealogy assumption has bias of 10-50%. Thus, the estimator is appropriate only for large populations that have grown very rapidly. The “tree-length estimator” of TMRCA is more biased than the pairwise comparison estimator, having low bias only for extremely large values of Nr.

Molecular epidemiological characteristics of echovirus 6 in mainland China: extensive circulation of genotype F from 2007 to 2018

Archives of Virology ◽

10.1007/s00705-020-04934-7 ◽

2021 ◽

Author(s):

Wenjun Cheng ◽

Tianjiao Ji ◽

Shuaifeng Zhou ◽

Yong Shi ◽

Lili Jiang ◽

...

Keyword(s):

Common Ancestor ◽

Mainland China ◽

Recent Common Ancestor ◽

Genetic Characteristics ◽

Significant Difference ◽

Echovirus 6 ◽

Epidemiological Characteristics ◽

Highest Posterior Density ◽

Genotype F

AbstractEchovirus 6 (E6) is associated with various clinical diseases and is frequently detected in environmental sewage. Despite its high prevalence in humans and the environment, little is known about its molecular phylogeography in mainland China. In this study, 114 of 21,539 (0.53%) clinical specimens from hand, foot, and mouth disease (HFMD) cases collected between 2007 and 2018 were positive for E6. The complete VP1 sequences of 87 representative E6 strains, including 24 strains from this study, were used to investigate the evolutionary genetic characteristics and geographical spread of E6 strains. Phylogenetic analysis based on VP1 nucleotide sequence divergence showed that, globally, E6 strains can be grouped into six genotypes, designated A to F. Chinese E6 strains collected between 1988 and 2018 were found to belong to genotypes C, E, and F, with genotype F being predominant from 2007 to 2018. There was no significant difference in the geographical distribution of each genotype. The evolutionary rate of E6 was estimated to be 3.631 × 10-3 substitutions site-1 year-1 (95% highest posterior density [HPD]: 3.2406 × 10-3-4.031 × 10-3 substitutions site-1 year-1) by Bayesian MCMC analysis. The most recent common ancestor of the E6 genotypes was traced back to 1863, whereas their common ancestor in China was traced back to around 1962. A small genetic shift was detected in the Chinese E6 population size in 2009 according to Bayesian skyline analysis, which indicated that there might have been an epidemic around that year.

The Ancestry of a Sample of Sequences Subject to Recombination

Genetics ◽

10.1093/genetics/151.3.1217 ◽

1999 ◽

Vol 151 (3) ◽

pp. 1217-1228 ◽

Cited By ~ 1

Author(s):

Carsten Wiuf ◽

Jotun Hein

Keyword(s):

Genetic Distance ◽

Population Size ◽

Upper Bound ◽

Recombination Rate ◽

Common Ancestor ◽

Recent Common Ancestor ◽

Sequence Length ◽

The Mean ◽

Mean Time

Abstract In this article we discuss the ancestry of sequences sampled from the coalescent with recombination with constant population size 2N. We have studied a number of variables based on simulations of sample histories, and some analytical results are derived. Consider the leftmost nucleotide in the sequences. We show that the number of nucleotides sharing a most recent common ancestor (MRCA) with the leftmost nucleotide is ≈log(1 + 4N Lr)/4Nr when two sequences are compared, where L denotes sequence length in nucleotides, and r the recombination rate between any two neighboring nucleotides per generation. For larger samples, the number of nucleotides sharing MRCA with the leftmost nucleotide decreases and becomes almost independent of 4N Lr. Further, we show that a segment of the sequences sharing a MRCA consists in mean of 3/8Nr nucleotides, when two sequences are compared, and that this decreases toward 1/4Nr nucleotides when the whole population is sampled. A measure of the correlation between the genealogies of two nucleotides on two sequences is introduced. We show analytically that even when the nucleotides are separated by a large genetic distance, but share MRCA, the genealogies will show only little correlation. This is surprising, because the time until the two nucleotides shared MRCA is reciprocal to the genetic distance. Using simulations, the mean time until all positions in the sample have found a MRCA increases logarithmically with increasing sequence length and is considerably lower than a theoretically predicted upper bound. On the basis of simulations, it turns out that important properties of the coalescent with recombinations of the whole population are reflected in the properties of a sample of low size.

The evolutionary history and conservation value of disjunct Bartonia paniculata subsp. paniculata (Branched Bartonia) populations in Canada

Botany ◽

10.1139/cjb-2013-0063 ◽

2013 ◽

Vol 91 (9) ◽

pp. 605-613 ◽

Cited By ~ 6

Author(s):

Claudia Ciotir ◽

Chris Yesson ◽

Joanna Freeland

Keyword(s):

Great Lakes ◽

Evolutionary History ◽

Common Ancestor ◽

Recent Common Ancestor ◽

Great Lakes Region ◽

Conservation Value ◽

Disjunct Populations ◽

Management Plans ◽

Global Status

Understanding the spatial distribution of genetic diversity and its evolutionary history is an essential part of developing effective biodiversity management plans. This may be particularly true when considering the value of peripheral or disjunct populations. Although conservation decisions are often made with reference to geopolitical boundaries, many policy-makers also consider global distributions, and therefore a species’ global status may temper its regional status. Many disjunct populations can be found in the Great Lakes region of North America, including those of Bartonia paniculata subsp. paniculata, a species that has been designated as threatened in Canada but globally secure. We compared chloroplast sequences between disjunct (Canada) and core (USA) populations of B. paniculata subsp. paniculata separated by 600 km, which is the minimum distance between disjunct and core populations in this subspecies. We found that although lineages within the disjunct populations shared a relatively recent common ancestor, the genetic divergence between plants from Ontario and New Jersey was substantially greater than expected for a consubspecific comparison. A coalescence-based analysis dated the most recent common ancestor of the Canadian and US populations at approximately 534 000 years ago with the lower confidence estimate at 226 000 years ago. This substantially predates the Last Glacial Maximum and suggests that disjunct and core populations have followed independent evolutionary trajectories throughout multiple glacial–interglacial cycles. Our findings provide important insight into the diverse processes that have resulted in numerous disjunct species in the Great Lakes region and highlight a need for additional work on Canadian B. paniculata subsp. paniculata taxonomy prior to a reevaluation of its conservation value.