scholarly journals Computational Inference Beyond Kingman's Coalescent

2015 ◽  
Vol 52 (2) ◽  
pp. 519-537 ◽  
Author(s):  
Jere Koskela ◽  
Paul Jenkins ◽  
Dario Spanò

Full likelihood inference under Kingman's coalescent is a computationally challenging problem to which importance sampling (IS) and the product of approximate conditionals (PAC) methods have been applied successfully. Both methods can be expressed in terms of families of intractable conditional sampling distributions (CSDs), and rely on principled approximations for accurate inference. Recently, more general Λ- and Ξ-coalescents have been observed to provide better modelling fits to some genetic data sets. We derive families of approximate CSDs for finite sites Λ- and Ξ-coalescents, and use them to obtain ‘approximately optimal’ IS and PAC algorithms for Λ-coalescents, yielding substantial gains in efficiency over existing methods.


2015 ◽  
Vol 52 (02) ◽  
pp. 519-537 ◽  
Author(s):  
Jere Koskela ◽  
Paul Jenkins ◽  
Dario Spanò

Full likelihood inference under Kingman's coalescent is a computationally challenging problem to which importance sampling (IS) and the product of approximate conditionals (PAC) methods have been applied successfully. Both methods can be expressed in terms of families of intractable conditional sampling distributions (CSDs), and rely on principled approximations for accurate inference. Recently, more general Λ- and Ξ-coalescents have been observed to provide better modelling fits to some genetic data sets. We derive families of approximate CSDs for finite sites Λ- and Ξ-coalescents, and use them to obtain ‘approximately optimal’ IS and PAC algorithms for Λ-coalescents, yielding substantial gains in efficiency over existing methods.



Genetics ◽  
2003 ◽  
Vol 164 (4) ◽  
pp. 1561-1566
Author(s):  
Sharon Browning

AbstractWe propose a new method for calculating probabilities for pedigree genetic data that incorporates crossover interference using the chi-square models. Applications include relationship inference, genetic map construction, and linkage analysis. The method is based on importance sampling of unobserved inheritance patterns conditional on the observed genotype data and takes advantage of fast algorithms for no-interference models while using reweighting to allow for interference. We show that the method is effective for arbitrarily many markers with small pedigrees.



2012 ◽  
Vol 44 (2) ◽  
pp. 391-407 ◽  
Author(s):  
Anand Bhaskar ◽  
Yun S. Song

Obtaining a closed-form sampling distribution for the coalescent with recombination is a challenging problem. In the case of two loci, a new framework based on an asymptotic series has recently been developed to derive closed-form results when the recombination rate is moderate to large. In this paper, an arbitrary number of loci is considered and combinatorial approaches are employed to find closed-form expressions for the first couple of terms in an asymptotic expansion of the multi-locus sampling distribution. These expressions are universal in the sense that their functional form in terms of the marginal one-locus distributions applies to all finite- and infinite-alleles models of mutation.



2019 ◽  
Vol 9 (14) ◽  
pp. 2841 ◽  
Author(s):  
Nan Zhang ◽  
Xueyi Gao ◽  
Tianyou Yu

Attribute reduction is a challenging problem in rough set theory, which has been applied in many research fields, including knowledge representation, machine learning, and artificial intelligence. The main objective of attribute reduction is to obtain a minimal attribute subset that can retain the same classification or discernibility properties as the original information system. Recently, many attribute reduction algorithms, such as positive region preservation, generalized decision preservation, and distribution preservation, have been proposed. The existing attribute reduction algorithms for generalized decision preservation are mainly based on the discernibility matrix and are, thus, computationally very expensive and hard to use in large-scale and high-dimensional data sets. To overcome this problem, we introduce the similarity degree for generalized decision preservation. On this basis, the inner and outer significance measures are proposed. By using heuristic strategies, we develop two quick reduction algorithms for generalized decision preservation. Finally, theoretical and experimental results show that the proposed heuristic reduction algorithms are effective and efficient.



2004 ◽  
Vol 61 (7) ◽  
pp. 1075-1082 ◽  
Author(s):  
Steven T Kalinowski

Genetic data can be used to estimate the stock composition of mixed-stock fisheries. Designing efficient strategies for estimating mixture proportions is important, but several aspects of study design remain poorly understood, particularly the relationship between genetic polymorphism and estimation error. In this study, computer simulation was used to investigate how the following variables affect expected squared error of mixture estimates: the number of loci examined, the number of alleles at those loci, and the size of baseline data sets. This work showed that (i) loci with more alleles produced estimates of stock proportions that had a lower expected squared error than less polymorphic loci, (ii) highly polymorphic loci did not require larger samples than less polymorphic loci, and (iii) the total number of independent alleles examined is a reasonable indicator of the quality of estimates of stock proportions.



2012 ◽  
Vol 44 (02) ◽  
pp. 391-407 ◽  
Author(s):  
Anand Bhaskar ◽  
Yun S. Song

Obtaining a closed-form sampling distribution for the coalescent with recombination is a challenging problem. In the case of two loci, a new framework based on an asymptotic series has recently been developed to derive closed-form results when the recombination rate is moderate to large. In this paper, an arbitrary number of loci is considered and combinatorial approaches are employed to find closed-form expressions for the first couple of terms in an asymptotic expansion of the multi-locus sampling distribution. These expressions are universal in the sense that their functional form in terms of the marginal one-locus distributions applies to all finite- and infinite-alleles models of mutation.



2014 ◽  
Author(s):  
Susanta Tewari ◽  
John L Spouge

Importance sampling is widely used in coalescent theory to compute data likelihood. Efficient importance sampling requires a trial distribution close to the target distribution of the genealogies conditioned on the data. Moreover, an efficient proposal requires intuition about how the data influence the target distribution. Different proposals might work under similar conditions, and sometimes the corresponding concepts overlap extensively. Currently, there is no framework available for coalescent theory that evaluates proposals in an integrated manner. Typically, problems are not modeled, optimization is performed vigorously on limited datasets, user interaction requires thorough knowledge, and programs are not aligned with the current demands of open science. We have designed a general framework (http://coalescent.sourceforge.net) for importance sampling, to compute data likelihood under the infinite sites model of mutation. The framework models the necessary core concepts, comes integrated with several data sets of varying size, implements the standard competing proposals, and integrates tightly with our previous framework for calculating exact probabilities. The framework computes the data likelihood and provides maximum likelihood estimates of the mutation parameter. Well-known benchmarks in the coalescent literature validate the framework’s accuracy. We evaluate several proposals in the coalescent literature, to discover that the order of efficiency among three standard proposals changes when running time is considered along with the effective sample size. The framework provides an intuitive user interface with minimal clutter. For speed, the framework switches automatically to modern multicore hardware, if available. It runs on three major platforms (Windows, Mac and Linux). Extensive tests and coverage make the framework accessible to a large community.



2006 ◽  
Vol 33 ◽  
pp. 39-49 ◽  
Author(s):  
Mark Pluciennik

This paper examines the ways in which genetic data have been used to interpret the transition to agriculture in Europe over the past two decades, and the relationship of these interpretations to more strictly archaeological explanations. It is suggested that, until recently, those working within the two disciplines have been using not only different data sets and methodologies, but also working within different disciplinary traditions which have inhibited communication and collaboration, and the production of a genuinely integrated field of ‘archaeogenetics’.



2019 ◽  
Author(s):  
Emil Jørsboe ◽  
Anders Albrechtsen

1AbstractIntroductionAssociation studies using genetic data from SNP-chip based imputation or low depth sequencing data provide a cost efficient design for large scale studies. However, these approaches provide genetic data with uncertainty of the observed genotypes. Here we explore association methods that can be applied to data where the genotype is not directly observed. We investigate how using different priors when estimating genotype probabilities affects the association results in different scenarios such as studies with population structure and varying depth sequencing data. We also suggest a method (ANGSD-asso) that is computational feasible for analysing large scale low depth sequencing data sets, such as can be generated by the non-invasive prenatal testing (NIPT) with low-pass sequencing.MethodsANGSD-asso’s EM model works by modelling the unobserved genotype as a latent variable in a generalised linear model framework. The software is implemented in C/C++ and can be run multi-threaded enabling the analysis of big data sets. ANGSD-asso is based on genotype probabilities, they can be estimated in various ways, such as using the sample allele frequency as a prior, using the individual allele frequencies as a prior or using haplotype frequencies from haplotype imputation. Using simulations of sequencing data we explore how genotype probability based method compares to using genetic dosages in large association studies with genotype uncertainty.Results & DiscussionOur simulations show that in a structured population using the individual allele frequency prior has better power than the sample allele frequency. If there is a correlation between genotype uncertainty and phenotype, then the individual allele frequency prior also helps control the false positive rate. In the absence of population structure the sample allele frequency prior and the individual allele frequency prior perform similarly. In scenarios with sequencing depth and phenotype correlation ANGSD-asso’s EM model has better statistical power and less bias compared to using dosages. Lastly when adding additional covariates to the linear model ANGSD-asso’s EM model has more statistical power and provides less biased effect sizes than other methods that accommodate genotype uncertainly, while also being much faster. This makes it possible to properly account for genotype uncertainty in large scale association studies.



2019 ◽  
Author(s):  
Aaron J. Stern ◽  
Peter R. Wilton ◽  
Rasmus Nielsen

AbstractMost current methods for detecting natural selection from DNA sequence data are limited in that they are either based on summary statistics or a composite likelihood, and as a consequence, do not make full use of the information available in DNA sequence data. We here present a new importance sampling approach for approximating the full likelihood function for the selection coefficient. The method treats the ancestral recombination graph (ARG) as a latent variable that is integrated out using previously published Markov Chain Monte Carlo (MCMC) methods. The method can be used for detecting selection, estimating selection coefficients, testing models of changes in the strength of selection, estimating the time of the start of a selective sweep, and for inferring the allele frequency trajectory of a selected or neutral allele. We perform extensive simulations to evaluate the method and show that it uniformly improves power to detect selection compared to current popular methods such as nSL and SDS, under various demographic models and can provide reliable inferences of allele frequency trajectories under many conditions. We also explore the potential of our method to detect extremely recent changes in the strength of selection. We use the method to infer the past allele frequency trajectory for a lactase persistence SNP (MCM6) in Europeans. We also study a set of 11 pigmentation-associated variants. Several genes show evidence of strong selection particularly within the last 5,000 years, including ASIP, KITLG, and TYR. However, selection on OCA2/HERC2 seems to be much older and, in contrast to previous claims, we find no evidence of selection on TYRP1.Author summaryCurrent methods to study natural selection using modern population genomic data are limited in their power and flexibility. Here, we present a new method to infer natural selection that builds on recent methodological advances in estimating genome-wide genealogies. By using importance sampling we are able to efficiently estimate the likelihood function of the selection coefficient. We show our method improves power to test for selection over competing methods across a diverse range of scenarios, and also accurately infers the selection coefficient. We also demonstrate a novel capability of our model, using it to infer the allele’s frequency over time. We validate these results with a study of a lactase persistence SNP in Europeans, and also study a set of 11 pigmentation-associated variants.



Sign in / Sign up

Export Citation Format

Share Document