Full likelihood inference from the site frequency spectrum based on the optimal tree resolution

Full Likelihood Inference from the Site Frequency Spectrum based on the Optimal Tree Resolution

10.1101/181412 ◽

2017 ◽

Author(s):

Raazesh Sainudiin ◽

Amandine Véber

Keyword(s):

Frequency Spectrum ◽

Likelihood Function ◽

Simulation Studies ◽

Genealogical Tree ◽

Full Likelihood ◽

Branch Lengths ◽

Optimal Tree ◽

Site Frequency Spectrum ◽

Tree Resolution ◽

Balance Parameter

AbstractWe develop a novel importance sampler to compute the full likelihood function of a demographic or structural scenario given the site frequency spectrum (SFS) at a locus free of intra-locus recombination. This sampler, instead of representing the hidden genealogy of a sample of individuals by a labelled binary tree, uses the minimal level of information about such a tree that is needed for the likelihood of the SFS and thus takes advantage of the huge reduction in the size of the state space that needs to be integrated. We assume that the population may have demographically changed and may be non-panmictically structured, as reflected by the branch lengths and the topology of the genealogical tree of the sample, respectively. We also assume that mutations conform to the infinitely-many-sites model. We achieve this by a controlled Markov process that generates ‘particles’ in the hidden space of SFS histories which are always compatible with the observed SFS.To produce the particles, we use Aldous’ Beta-splitting model for a one parameter family of prior distributions over genealogical topologies or shapes (including that of the Kingman coalescent) and allow the branch lengths or epoch times to have a parametric family of priors specified by a model of demography (including exponential growth and bottleneck models). Assuming independence across unlinked loci, we can estimate the likelihood of a population scenario based on a large collection of independent SFS by an importance sampling scheme, using the (unconditional) distribution of the genealogies under this scenario when the latter is available. When it is not available, we instead compute the joint likelihood of the tree balance parameter β assuming that the tree topology follows Aldous’ Beta-splitting model, and of the demographic scenario determining the distribution of the inter-coalescence times or epoch times in the genealogy of a sample, in order to at least distinguish different equivalence classes of population scenarios leading to different tree balances and epoch times. Simulation studies are conducted to demonstrate the capabilities of the approach with publicly available code.

Download Full-text

Faculty Opinions recommendation of The site frequency spectrum for general coalescents.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.726151440.793522507 ◽

2016 ◽

Author(s):

Alon Keinan

Keyword(s):

Frequency Spectrum ◽

Site Frequency Spectrum

Download Full-text

Non-parametric estimation of population size changes from the site frequency spectrum

10.1101/125351 ◽

2017 ◽

Author(s):

Berit Lindum Waltoft ◽

Asger Hobolth

Keyword(s):

Population Size ◽

Frequency Spectrum ◽

Goodness Of Fit ◽

Weighted Average ◽

Cubic Spline ◽

Parametric Estimation ◽

New Method ◽

Eigenvalue Decomposition ◽

Human Populations ◽

Site Frequency Spectrum

AbstractThe variability in population size is a key quantity for understanding the evolutionary history of a species. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from the site frequency spectrum. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the variability in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on data from nine different human populations.

Download Full-text

The divergence history of European blue mussel species reconstructed from Approximate Bayesian Computation: the effects of sequencing techniques and sampling strategies

10.1101/259135 ◽

2018 ◽

Author(s):

Christelle Fraïsse ◽

Camille Roux ◽

Pierre-Alexandre Gagnaire ◽

Jonathan Romiguier ◽

Nicolas Faivre ◽

...

Keyword(s):

Frequency Spectrum ◽

Approximate Bayesian Computation ◽

Exome Capture ◽

Bayesian Computation ◽

Sampling Strategies ◽

Migration Rates ◽

History Of ◽

Site Frequency Spectrum ◽

Approximate Bayesian ◽

Number Of Individuals

AbstractGenome-scale diversity data are increasingly available in a variety of biological systems, and can be used to reconstruct the past evolutionary history of species divergence. However, extracting the full demographic information from these data is not trivial, and requires inferential methods that account for the diversity of coalescent histories throughout the genome. Here, we evaluate the potential and limitations of one such approach. We reexamine a well-known system of mussel sister species, using the joint site frequency spectrum (jSFS) of synonymous mutations computed either from exome capture or RNA-seq, in an Approximate Bayesian Computation (ABC) framework. We first assess the best sampling strategy (number of: individuals, loci, and bins in the jSFS), and show that model selection is robust to variation in the number of individuals and loci. In contrast, different binning choices when summarizing the joint site frequency spectrum, strongly affect the results: including classes of low and high frequency shared polymorphisms can more effectively reveal recent migration events. We then take advantage of the flexibility of ABC to compare more realistic models of speciation, including variation in migration rates through time (i.e. periodic connectivity) and across genes (i.e. genome-wide heterogeneity in migration rates). We show that these models were consistently selected as the most probable, suggesting that mussels have experienced a complex history of gene flow during divergence and that the species boundary is semi-permeable. Our work provides a comprehensive evaluation of ABC demographic inference in mussels based on the coding site frequency spectrum, and supplies guidelines for employing different sequencing techniques and sampling strategies. We emphasize, perhaps surprisingly, that inferences are less limited by the volume of data, than by the way in which they are analyzed.

Download Full-text

Contemporary Demographic Reconstruction Methods Are Robust to Genome Assembly Quality: A Case Study in Tasmanian Devils

Molecular Biology and Evolution ◽

10.1093/molbev/msz191 ◽

2019 ◽

Vol 36 (12) ◽

pp. 2906-2921 ◽

Cited By ~ 20

Author(s):

Austin H Patton ◽

Mark J Margres ◽

Amanda R Stahlke ◽

Sarah Hendricks ◽

Kevin Lewallen ◽

...

Keyword(s):

Population Size ◽

Frequency Spectrum ◽

Genome Assembly ◽

Genomic Sequence ◽

Demographic History ◽

Tasmanian Devil ◽

Assembly Quality ◽

Reconstruction Methods ◽

Site Frequency Spectrum ◽

Genome Assemblies

Abstract Reconstructing species’ demographic histories is a central focus of molecular ecology and evolution. Recently, an expanding suite of methods leveraging either the sequentially Markovian coalescent (SMC) or the site-frequency spectrum has been developed to reconstruct population size histories from genomic sequence data. However, few studies have investigated the robustness of these methods to genome assemblies of varying quality. In this study, we first present an improved genome assembly for the Tasmanian devil using the Chicago library method. Compared with the original reference genome, our new assembly reduces the number of scaffolds (from 35,975 to 10,010) and increases the scaffold N90 (from 0.101 to 2.164 Mb). Second, we assess the performance of four contemporary genomic methods for inferring population size history (PSMC, MSMC, SMC++, Stairway Plot), using the two devil genome assemblies as well as simulated, artificially fragmented genomes that approximate the hypothesized demographic history of Tasmanian devils. We demonstrate that each method is robust to assembly quality, producing similar estimates of Ne when simulated genomes were fragmented into up to 5,000 scaffolds. Overall, methods reliant on the SMC are most reliable between ∼300 generations before present (gbp) and 100 kgbp, whereas methods exclusively reliant on the site-frequency spectrum are most reliable between the present and 30 gbp. Our results suggest that when used in concert, genomic methods for reconstructing species’ effective population size histories 1) can be applied to nonmodel organisms without highly contiguous reference genomes, and 2) are capable of detecting independently documented effects of historical geological events.

Download Full-text

Computational Inference Beyond Kingman's Coalescent

Journal of Applied Probability ◽

10.1017/s0021900200012614 ◽

2015 ◽

Vol 52 (02) ◽

pp. 519-537 ◽

Cited By ~ 10

Author(s):

Jere Koskela ◽

Paul Jenkins ◽

Dario Spanò

Keyword(s):

Importance Sampling ◽

Genetic Data ◽

Likelihood Inference ◽

Data Sets ◽

Conditional Sampling ◽

Challenging Problem ◽

Computational Inference ◽

Sampling Distributions ◽

Full Likelihood ◽

Kingman’S Coalescent

Full likelihood inference under Kingman's coalescent is a computationally challenging problem to which importance sampling (IS) and the product of approximate conditionals (PAC) methods have been applied successfully. Both methods can be expressed in terms of families of intractable conditional sampling distributions (CSDs), and rely on principled approximations for accurate inference. Recently, more general Λ- and Ξ-coalescents have been observed to provide better modelling fits to some genetic data sets. We derive families of approximate CSDs for finite sites Λ- and Ξ-coalescents, and use them to obtain ‘approximately optimal’ IS and PAC algorithms for Λ-coalescents, yielding substantial gains in efficiency over existing methods.

Download Full-text