scholarly journals Full Likelihood Inference from the Site Frequency Spectrum based on the Optimal Tree Resolution

2017 ◽  
Author(s):  
Raazesh Sainudiin ◽  
Amandine Véber

AbstractWe develop a novel importance sampler to compute the full likelihood function of a demographic or structural scenario given the site frequency spectrum (SFS) at a locus free of intra-locus recombination. This sampler, instead of representing the hidden genealogy of a sample of individuals by a labelled binary tree, uses the minimal level of information about such a tree that is needed for the likelihood of the SFS and thus takes advantage of the huge reduction in the size of the state space that needs to be integrated. We assume that the population may have demographically changed and may be non-panmictically structured, as reflected by the branch lengths and the topology of the genealogical tree of the sample, respectively. We also assume that mutations conform to the infinitely-many-sites model. We achieve this by a controlled Markov process that generates ‘particles’ in the hidden space of SFS histories which are always compatible with the observed SFS.To produce the particles, we use Aldous’ Beta-splitting model for a one parameter family of prior distributions over genealogical topologies or shapes (including that of the Kingman coalescent) and allow the branch lengths or epoch times to have a parametric family of priors specified by a model of demography (including exponential growth and bottleneck models). Assuming independence across unlinked loci, we can estimate the likelihood of a population scenario based on a large collection of independent SFS by an importance sampling scheme, using the (unconditional) distribution of the genealogies under this scenario when the latter is available. When it is not available, we instead compute the joint likelihood of the tree balance parameter β assuming that the tree topology follows Aldous’ Beta-splitting model, and of the demographic scenario determining the distribution of the inter-coalescence times or epoch times in the genealogy of a sample, in order to at least distinguish different equivalence classes of population scenarios leading to different tree balances and epoch times. Simulation studies are conducted to demonstrate the capabilities of the approach with publicly available code.

2021 ◽  
Author(s):  
Helmut E Simon ◽  
Gavin A Huttley

The site frequency spectrum (SFS) is a commonly used statistic to summarize genetic variation in a sample of genomic sequences from a population. Such a genomic sample is associated with an imputed genealogical history with attributes such as branch lengths, coalescence times and the time to the most recent common ancestor (TMRCA) as well as topological and combinatorial properties. We present a Bayesian model for sampling from the joint posterior distribution of coalescence times conditional on the SFS associated with a sample of sequences in the absence of selection. In this model, the combinatorial properties of a genealogy, which is represented as a coalescent tree, are expressed as matrices. This facilitates the calculation of likelihoods and the effective sampling of the entire space of tree structures according to the Equal Rates Markov (or Yule-type) measure. Unlike previous methods, assumptions as to the type of stochastic process that generated the genealogical tree are not required. Novel approaches to defining both uninformative and informative prior distributions are employed. The uncertainty in inference due to the stochastic nature of mutation and the unknown tree structure is expressed by the shape of the posterior distributions. The method is implemented using the general purpose Markov Chain Monte Carlo software PyMC3. From the sampled posterior distribution of coalescence times, one can also infer related quantities such as the number of ancestors of a sample at a given time in the past (ancestral distribution) and the probability of specific relationships between branch lengths (for example, that the most recent branch is longer than all the others). The performance of the method is evaluated against simulated data and is also applied to historic mitochondrial data from the Nuu-Chah-Nulth people of North America. The method can be used to obtain estimates of the TMRCA of the sample. The relationship of these estimates to those given by ''Thomson's estimator'' is explored. Keywords: coalescent theory; Bayesian inference; time to most recent common ancestor; site frequency spectrum


2017 ◽  
Author(s):  
Berit Lindum Waltoft ◽  
Asger Hobolth

AbstractThe variability in population size is a key quantity for understanding the evolutionary history of a species. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from the site frequency spectrum. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the variability in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on data from nine different human populations.


2018 ◽  
Author(s):  
Christelle Fraïsse ◽  
Camille Roux ◽  
Pierre-Alexandre Gagnaire ◽  
Jonathan Romiguier ◽  
Nicolas Faivre ◽  
...  

AbstractGenome-scale diversity data are increasingly available in a variety of biological systems, and can be used to reconstruct the past evolutionary history of species divergence. However, extracting the full demographic information from these data is not trivial, and requires inferential methods that account for the diversity of coalescent histories throughout the genome. Here, we evaluate the potential and limitations of one such approach. We reexamine a well-known system of mussel sister species, using the joint site frequency spectrum (jSFS) of synonymous mutations computed either from exome capture or RNA-seq, in an Approximate Bayesian Computation (ABC) framework. We first assess the best sampling strategy (number of: individuals, loci, and bins in the jSFS), and show that model selection is robust to variation in the number of individuals and loci. In contrast, different binning choices when summarizing the joint site frequency spectrum, strongly affect the results: including classes of low and high frequency shared polymorphisms can more effectively reveal recent migration events. We then take advantage of the flexibility of ABC to compare more realistic models of speciation, including variation in migration rates through time (i.e. periodic connectivity) and across genes (i.e. genome-wide heterogeneity in migration rates). We show that these models were consistently selected as the most probable, suggesting that mussels have experienced a complex history of gene flow during divergence and that the species boundary is semi-permeable. Our work provides a comprehensive evaluation of ABC demographic inference in mussels based on the coding site frequency spectrum, and supplies guidelines for employing different sequencing techniques and sampling strategies. We emphasize, perhaps surprisingly, that inferences are less limited by the volume of data, than by the way in which they are analyzed.


2019 ◽  
Vol 36 (12) ◽  
pp. 2906-2921 ◽  
Author(s):  
Austin H Patton ◽  
Mark J Margres ◽  
Amanda R Stahlke ◽  
Sarah Hendricks ◽  
Kevin Lewallen ◽  
...  

Abstract Reconstructing species’ demographic histories is a central focus of molecular ecology and evolution. Recently, an expanding suite of methods leveraging either the sequentially Markovian coalescent (SMC) or the site-frequency spectrum has been developed to reconstruct population size histories from genomic sequence data. However, few studies have investigated the robustness of these methods to genome assemblies of varying quality. In this study, we first present an improved genome assembly for the Tasmanian devil using the Chicago library method. Compared with the original reference genome, our new assembly reduces the number of scaffolds (from 35,975 to 10,010) and increases the scaffold N90 (from 0.101 to 2.164 Mb). Second, we assess the performance of four contemporary genomic methods for inferring population size history (PSMC, MSMC, SMC++, Stairway Plot), using the two devil genome assemblies as well as simulated, artificially fragmented genomes that approximate the hypothesized demographic history of Tasmanian devils. We demonstrate that each method is robust to assembly quality, producing similar estimates of Ne when simulated genomes were fragmented into up to 5,000 scaffolds. Overall, methods reliant on the SMC are most reliable between ∼300 generations before present (gbp) and 100 kgbp, whereas methods exclusively reliant on the site-frequency spectrum are most reliable between the present and 30 gbp. Our results suggest that when used in concert, genomic methods for reconstructing species’ effective population size histories 1) can be applied to nonmodel organisms without highly contiguous reference genomes, and 2) are capable of detecting independently documented effects of historical geological events.


2020 ◽  
Vol 189 (11) ◽  
pp. 1421-1426
Author(s):  
Yicheng Ma ◽  
Helen E Jenkins ◽  
Paola Sebastiani ◽  
Jerrold J Ellner ◽  
Edward C Jones-López ◽  
...  

Abstract Serial interval (SI), defined as the time between symptom onset in an infector and infectee pair, is commonly used to understand infectious diseases transmission. Slow progression to active disease, as well as the small percentage of individuals who will eventually develop active disease, complicate the estimation of the SI for tuberculosis (TB). In this paper, we showed via simulation studies that when there is credible information on the percentage of those who will develop TB disease following infection, a cure model, first introduced by Boag in 1949, should be used to estimate the SI for TB. This model includes a parameter in the likelihood function to account for the study population being composed of those who will have the event of interest and those who will never have the event. We estimated the SI for TB to be approximately 0.5 years for the United States and Canada (January 2002 to December 2006) and approximately 2.0 years for Brazil (March 2008 to June 2012), which might imply a higher occurrence of reinfection TB in a developing country like Brazil.


Genetics ◽  
2016 ◽  
Vol 202 (4) ◽  
pp. 1549-1561 ◽  
Author(s):  
Jeffrey P. Spence ◽  
John A. Kamm ◽  
Yun S. Song

2020 ◽  
Vol 29 (10) ◽  
pp. 2919-2931
Author(s):  
Xinyi Ge ◽  
Yingwei Peng ◽  
Dongsheng Tu

Identification of a subset of patients who may be sensitive to a specific treatment is an important problem in clinical trials. In this paper, we consider the case where the treatment effect is measured by longitudinal outcomes, such as quality of life scores assessed over the duration of a clinical trial, and the subset is determined by a continuous baseline covariate, such as age and expression level of a biomarker. A threshold linear mixed model is introduced, and a smoothing maximum likelihood method is proposed to obtain the estimation of the parameters in the model. Broyden-Fletcher-Goldfarb-Shanno algorithm is employed to maximize the proposed smoothing likelihood function. The proposed procedure is evaluated through simulation studies and application to the analysis of data from a randomized clinical trial on patients with advanced colorectal cancer.


Sign in / Sign up

Export Citation Format

Share Document