scholarly journals Deep learning for population size history inference: Design, comparison and combination with approximate Bayesian computation

Author(s):  
Théophile Sanchez ◽  
Jean Cury ◽  
Guillaume Charpiat ◽  
Flora Jay
PLoS Genetics ◽  
2016 ◽  
Vol 12 (3) ◽  
pp. e1005877 ◽  
Author(s):  
Simon Boitard ◽  
Willy Rodríguez ◽  
Flora Jay ◽  
Stefano Mona ◽  
Frédéric Austerlitz

Author(s):  
Théophile Sanchez ◽  
Jean Cury ◽  
Guillaume Charpiat ◽  
Flora Jay

AbstractFor the past decades, simulation-based likelihood-free inference methods have enabled researchers to address numerous population genetics problems. As the richness and amount of simulated and real genetic data keep increasing, the field has a strong opportunity to tackle tasks that current methods hardly solve. However, high data dimensionality forces most methods to summarize large genomic datasets into a relatively small number of handcrafted features (summary statistics). Here we propose an alternative to summary statistics, based on the automatic extraction of relevant information using deep learning techniques. Specifically, we design artificial neural networks (ANNs) that take as input single nucleotide polymorphic sites (SNPs) found in individuals sampled from a single population and infer the past effective population size history. First, we provide guidelines to construct artificial neural networks that comply with the intrinsic properties of SNP data such as invariance to permutation of haplotypes, long scale interactions between SNPs and variable genomic length. Thanks to a Bayesian hyperparameter optimization procedure, we evaluate the performance of multiple networks and compare them to well established methods like Approximate Bayesian Computation (ABC). Even without the expert knowledge of summary statistics, our approach compares fairly well to an ABC based on handcrafted features. Furthermore we show that combining deep learning and ABC can improve performance while taking advantage of both frameworks. Finally, we apply our approach to reconstruct the effective population size history of cattle breed populations.


2020 ◽  
Author(s):  
Manolo F. Perez ◽  
Isabel A. S. Bonatelli ◽  
Monique Romeiro-Brito ◽  
Fernando F. Franco ◽  
Nigel P. Taylor ◽  
...  

AbstractDelimiting species boundaries is a major goal in evolutionary biology. An increasing body of literature has focused on the challenges of investigating cryptic diversity within complex evolutionary scenarios of speciation, including gene flow and demographic fluctuations. New methods based on model selection, such as approximate Bayesian computation, approximate likelihood, and machine learning approaches, are promising tools arising in this field. Here, we introduce a framework for species delimitation using the multispecies coalescent model coupled with a deep learning algorithm based on convolutional neural networks (CNNs). We compared this strategy with a similar ABC approach. We applied both methods to test species boundary hypotheses based on current and previous taxonomic delimitations as well as genetic data (sequences from 41 loci) in Pilosocereus aurisetus, a cactus species with a sky-island distribution and taxonomic uncertainty. To validate our proposed method, we also applied the same strategy on sequence data from widely accepted species from the genus Drosophila. The results show that our CNN approach has high capacity to distinguish among the simulated species delimitation scenarios, with higher accuracy than the ABC procedure. For Pilosocereus, the delimitation hypothesis based on a splitter taxonomic arrangement without migration showed the highest probability in both CNN and ABC approaches. The splits observed within P. aurisetus agree with previous taxonomic conjectures considering more taxonomic entities within currently accepted species. Our results highlight the cryptic diversity within P. aurisetus and show that CNNs are a promising approach for distinguishing divergent and complex evolutionary histories, even outperforming the accuracy of other model-based approaches such as ABC. Keywords: Species delimitation, fragmented systems, recent diversification, deep learning, Convolutional Neural Networks, Approximate Bayesian Computation


2016 ◽  
Author(s):  
Simon Boitard ◽  
Willy Rodriguez ◽  
Flora Jay ◽  
Stefano Mona ◽  
Frederic Austeritz

Inferring the ancestral dynamics of effective population size is a long-standing question in population genetics, which can now be tackled much more accurately thanks to the massive genomic data available in many species. Several promising methods that take advantage of whole-genome sequences have been recently developed in this context. However, they can only be applied to rather small samples, which limits their ability to estimate recent population size history. Besides, they can be very sensitive to sequencing or phasing errors. Here we introduce a new approximate Bayesian computation approach named PopSizeABC that allows estimating the evolution of the effective population size through time, using a large sample of complete genomes. This sample is summarized using the folded allele frequency spectrum and the average zygotic linkage disequilibrium at different bins of physical distance, two classes of statistics that are widely used in population genetics and can be easily computed from unphased and unpolarized SNP data. Our approach provides accurate estimations of past population sizes, from the very first generations before present back to the expected time to the most recent common ancestor of the sample, as shown by simulations under a wide range of demographic scenarios. When applied to samples of 15 or 25 complete genomes in four cattle breeds (Angus, Fleckvieh, Holstein and Jersey), PopSizeABC revealed a series of population declines, related to historical events such as domestication or modern breed creation. We further highlight that our approach is robust to sequencing errors, provided summary statistics are computed from SNPs with common alleles.


Author(s):  
Cecilia Viscardi ◽  
Michele Boreale ◽  
Fabio Corradi

AbstractWe consider the problem of sample degeneracy in Approximate Bayesian Computation. It arises when proposed values of the parameters, once given as input to the generative model, rarely lead to simulations resembling the observed data and are hence discarded. Such “poor” parameter proposals do not contribute at all to the representation of the parameter’s posterior distribution. This leads to a very large number of required simulations and/or a waste of computational resources, as well as to distortions in the computed posterior distribution. To mitigate this problem, we propose an algorithm, referred to as the Large Deviations Weighted Approximate Bayesian Computation algorithm, where, via Sanov’s Theorem, strictly positive weights are computed for all proposed parameters, thus avoiding the rejection step altogether. In order to derive a computable asymptotic approximation from Sanov’s result, we adopt the information theoretic “method of types” formulation of the method of Large Deviations, thus restricting our attention to models for i.i.d. discrete random variables. Finally, we experimentally evaluate our method through a proof-of-concept implementation.


Sign in / Sign up

Export Citation Format

Share Document