coalescent simulations Latest Research Papers

Theoretical Analysis of Principal Components in an Umbrella Model of Intraspecific Evolution

10.1101/2021.11.28.470252 ◽

2021 ◽

Author(s):

Maxime Estavoyer ◽

Olivier Francois

Keyword(s):

Principal Components ◽

Rare Variants ◽

Isolation By Distance ◽

Geographic Distance ◽

Principal Component ◽

Ancestral Population ◽

Alternative Theory ◽

Multilocus Genotype ◽

Coalescent Simulations ◽

Largest Eigenvalues

Principal component analysis (PCA) is one of the most frequently-used approach to describe population structure from multilocus genotype data. Regarding geographic range expansions of modern humans, interpretations of PCA have, however, been questioned, as there is uncertainty about the wave-like patterns that have been observed in principal components. It has indeed been argued that wave-like patterns are mathematical artifacts that arise generally when PCA is applied to data in which genetic differentiation increases with geographic distance. Here, we present an alternative theory for the observation of wave-like patterns in PCA. We study a coalescent model -- the umbrella model -- for the diffusion of genetic variants. The model is based on a hierarchy of splits from an ancestral population without any particular geographical structure. In the umbrella model, splits occur almost continuously in time, giving birth to small daughter populations at a regular pace. Our results provide detailed mathematical descriptions of eigenvalues and eigenvectors for the PCA of sampled genomic sequences under the model. Removing variants uniquely represented in the sample, the PCA eigenvectors are defined as cosine functions of increasing periodicity, reproducing wave-like patterns observed in equilibrium isolation-by-distance models. Including rare variants in the analysis, the eigenvectors corresponding to the largest eigenvalues exhibit complex wave shapes. The accuracy of our predictions is further investigated with coalescent simulations. Our analysis supports the hypothesis that highly structured wave-like patterns could arise from genetic drift only, and may not always be artificial outcomes of spatially structured data. Genomic data related to the peopling of the Americas are reanalyzed in the light of our new theory.

Evaluation of methods for the inference of ancestral recombination graphs

10.1101/2021.11.15.468686 ◽

2021 ◽

Author(s):

Debora Y C Brandt ◽

Xinzhu Wei ◽

Yun Deng ◽

Andrew H. Vaughn ◽

Rasmus Nielsen

Keyword(s):

Best Practices ◽

Dna Sequences ◽

Posterior Distribution ◽

Genetic Parameters ◽

Ancestral Recombination Graph ◽

Effective Population ◽

Coalescent Simulations ◽

Ancestral Recombination Graphs ◽

Evaluation Of Methods ◽

Coalescence Times

The ancestral recombination graph (ARG) is a structure that describes the joint genealogies of sampled DNA sequences along the genome. Recent computational methods have made impressive progress towards scalably estimating whole-genome genealogies. In addition to inferring the ARG, some of these methods can also provide ARGs sampled from a defined posterior distribution. Obtaining good samples of ARGs is crucial for quantifying statistical uncertainty and for estimating population genetic parameters such as effective population size, mutation rate, and allele age. Here, we use simulations to benchmark three popular ARG inference programs: ARGweaver, Relate, and tsdate. We use neutral coalescent simulations to 1) compare the true coalescence times to the inferred times at each locus; 2) compare the distribution of coalescence times across all loci to the expected exponential distribution; 3) evaluate whether the sampled coalescence times have the properties expected of a valid posterior distribution. We find that inferred coalescence times at each locus are more accurate in ARGweaver and Relate than in tsdate. However, all three methods tend to overestimate small coalescence times and underestimate large ones. Lastly, the posterior distribution of ARGweaver is closer to the expected posterior distribution than Relate's, but this higher accuracy comes at a substantial trade-off in scalability. The best choice of method will depend on the number and length of input sequences and on the goal of downstream analyses, and we provide guidelines for the best practices.

Addressing alpine plant phylogeography using integrative distributional, demographic and coalescent modeling

Alpine Botany ◽

10.1007/s00035-021-00263-w ◽

2021 ◽

Cited By ~ 1

Author(s):

Dennis J. Larsson ◽

Da Pan ◽

Gerald M. Schneeweiss

Keyword(s):

Ad Hoc ◽

Alpine Plant ◽

Demographic Modeling ◽

Coalescent Simulations ◽

As Species ◽

Recent Approach ◽

History Of ◽

Species Specific ◽

The Last Glacial Maximum ◽

The Last Glacial

AbstractPhylogeographic studies of alpine plants have evolved considerably in the last two decades from ad hoc interpretations of genetic data to statistical model-based approaches. In this review we outline the developments in alpine plant phylogeography focusing on the recent approach of integrative distributional, demographic and coalescent (iDDC) modeling. By integrating distributional data with spatially explicit demographic modeling and subsequent coalescent simulations, the history of alpine species can be inferred and long-standing hypotheses, such as species-specific responses to climate change or survival on nunataks during the last glacial maximum, can be efficiently tested as exemplified by available case studies. We also discuss future prospects and improvements of iDDC.

Mathematical constraints on FST: multiallelic markers in arbitrarily many populations

10.1101/2021.07.23.453474 ◽

2021 ◽

Author(s):

Nicolas Alcala ◽

Noah A Rosenberg

Keyword(s):

Genetic Differentiation ◽

Joint Distribution ◽

Island Model ◽

General Description ◽

Multiple Populations ◽

Mutation Model ◽

Coalescent Simulations ◽

And Migration ◽

Interspecific Comparisons ◽

Frequent Allele

Interpretations of values of the FST measure of genetic differentiation rely on an understanding of its mathematical constraints. Previously, it has been shown that FST values computed from a biallelic locus in a set of multiple populations and FST values computed from a multiallelic locus in a pair of populations are mathematically constrained by the frequency of the allele that is most frequent across populations. We report here the mathematical constraint on FST given the frequency M of the most frequent allele at a multiallelic locus in a set of multiple populations, providing the most general description to date of mathematical constraints on FST in terms of M. Using coalescent simulations of an island model of migration with an infinitely-many-alleles mutation model, we argue that the joint distribution of FST and M helps in disentangling the separate influences of mutation and migration on FST. Finally, we show that our results explain puzzling patterns of microsatellite differentiation, such as the lower FST values in interspecific comparisons between humans and chimpanzees than in the intraspecific comparison of chimpanzee populations. We discuss the implications of our results for the use of FST.

A test of the hypothesis that variable mutation rates create signals that have previously been interpreted as evidence of archaic introgression into humans

10.1101/2020.12.23.424213 ◽

2020 ◽

Author(s):

William Amos

Keyword(s):

Mutation Rate ◽

Alternative Model ◽

Mutation Rates ◽

Human Populations ◽

Common Cause ◽

Coalescent Simulations ◽

Recurrent Mutations ◽

Archaic Introgression ◽

Excess Base ◽

Human Specific

AbstractIt is widely accepted that non-African humans carry 1-2% Neanderthal DNA due to historical inter-breeding. However, inferences about introgression rely on a critical assumption that mutation rate is constant and that back-mutations are too rare to be important. Both these assumptions have been challenged, and recent evidence points towards an alternative model where signals interpreted as introgression are driven mainly by higher mutation rates in Africa. In this model, non-Africans appear closer to archaics not because they harbour introgressed fragments but because Africans have diverged more. Here I test this idea by using the density of rare, human-specific variants (RHSVs) as a proxy for recent mutation rate. I find that sites that contribute most to the signal interpreted as introgression tend to occur in tightly defined regions spanning only a few hundred bases in which mutation rate differs greatly between the two human populations being compared. Mutation rate is invariably higher in the population into which introgression is not inferred. I confirmed that RHSV density reflects mutation rate by conducting a parallel analysis looking at the density of RHSVs around sites with three alleles, an independent class of site that also requires recurrent mutations to form. Near-identical peaks in RHSV density are found, suggesting a common cause. Similarly, coalescent simulations confirm that, with constant mutation rate, introgressed fragments do not occur preferentially in regions with a high density of rare, human-specific variants. Together, these observations are difficult to reconcile with a model where excess base-sharing is driven by archaic legacies but instead provide support for a higher mutation rate inside Africa driving increased divergence from the ancestral human state.

Pseudoreplication in genomics-scale datasets

10.1101/2020.11.12.380410 ◽

2020 ◽

Author(s):

Robin S. Waples ◽

Ryan K. Waples ◽

Eric J. Ward

Keyword(s):

Genome Size ◽

Variance Components ◽

Degrees Of Freedom ◽

Limiting Factor ◽

Entire Genome ◽

Variance Components Analysis ◽

Coalescent Simulations ◽

Components Analysis ◽

Correlated Information ◽

Rate Of Decline

AbstractIn genomics-scale datasets, loci are closely packed within chromosomes and hence provide correlated information. Averaging across loci as if they were independent creates pseudoreplication, which reduces the effective degrees of freedom (n’) compared to the nominal degrees of freedom, n. This issue has been known for some time, but consequences have not been systematically quantified across the entire genome. Here we measured pseudoreplication (quantified by the ratio n’/n) for a common metric of genetic differentiation (FST) and a common measure of linkage disequilibrium between pairs of loci (r2). Based on data simulated using models (SLiM and msprime) that allow efficient forward-in-time and coalescent simulations while precisely controlling population pedigrees, we estimated n’ and n’/n by measuring the rate of decline in variance of mean FST and mean r2 as more loci were used. For both indices, n’ increases with Ne and genome size, as expected. However, even for large Ne and large genomes, n’ for r2 plateaus after a few thousand loci, and a variance components analysis indicates that the limiting factor is uncertainty associated with sampling individuals rather than genes. Pseudoreplication is less extreme for FST, but n’/n ≤0.01 can occur in datasets using tens of thousands of loci. Commonly-used block-jackknife methods consistently overestimated var(FST), producing very conservative confidence intervals. Predicting n’ based on our modeling results as a function of Ne, L, S, and genome size provides a robust way to quantify precision associated with genomics-scale datasets.

CoaTran: Coalescent tree simulation along a transmission network

10.1101/2020.11.10.377499 ◽

2020 ◽

Author(s):

Niema Moshiri

Keyword(s):

Open Source Software ◽

Supplementary Information ◽

Software Project ◽

Transmission Network ◽

Simulation Experiments ◽

Computational Tools ◽

Coalescent Simulations ◽

Coalescent Tree ◽

Population Scale ◽

Global Population

AbstractMotivationThe ability to simulate coalescent viral phylogenies constrained by a given transmission network can enable the benchmarking of computational tools used in molecular epidemiology as well as the ability to gain insights into unobservable aspects of the virology of a novel pathogen. However, such simulation experiments require generating a large number of technical simulation replicates, and existing tools for coalescent simulations along a transmission network are too slow to conduct such experiments at the scale of the global population.ResultsCoaTran is a massively scalable tool that simulates a coalescent viral phylogeny constrained by a user-provided transmission network. CoaTran is written in highly-optimized C++ code and can generate global population scale phylogenetic coalescent simulations in seconds to minutes.AvailabilityCoaTran is freely available at https://github.com/niemasd/CoaTran as an open-source software [email protected] informationSupplementary data are available online.

Discovery of ongoing selective sweeps within Anopheles mosquito populations using deep learning

Molecular Biology and Evolution ◽

10.1093/molbev/msaa259 ◽

2020 ◽

Author(s):

Alexander T Xue ◽

Daniel R Schrider ◽

Andrew D Kern ◽

Keyword(s):

Deep Learning ◽

Evolutionary Dynamics ◽

Mosquito Control ◽

Sub Saharan Africa ◽

Supervised Machine Learning ◽

Selective Sweeps ◽

Genome Data ◽

Coalescent Simulations ◽

Sub Saharan ◽

Increasing Demand

Abstract Identification of partial sweeps, which include both hard and soft sweeps that have not currently reached fixation, provides crucial information about ongoing evolutionary responses. To this end, we introduce partialS/HIC, a deep learning method to discover selective sweeps from population genomic data. partialS/HIC uses a convolutional neural network for image processing, which is trained with a large suite of summary statistics derived from coalescent simulations incorporating population-specific history, to distinguish between completed versus partial sweeps, hard versus soft sweeps, and regions directly affected by selection versus those merely linked to nearby selective sweeps. We perform several simulation experiments under various demographic scenarios to demonstrate partialS/HIC’s performance, which exhibits excellent resolution for detecting partial sweeps. We also apply our classifier to whole genomes from eight mosquito populations sampled across sub-Saharan Africa by the Anopheles gambiae 1000 Genomes Consortium, elucidating both continent-wide patterns as well as sweeps unique to specific geographic regions. These populations have experienced intense insecticide exposure over the past two decades, and we observe a strong overrepresentation of sweeps at insecticide resistance loci. Our analysis thus provides a list of candidate adaptive loci that may be relevant to mosquito control efforts. More broadly, our supervised machine learning approach introduces a method to distinguish between completed and partial sweeps, as well as between hard and soft sweeps, under a variety of demographic scenarios. As whole-genome data rapidly accumulate for a greater diversity of organisms, partialS/HIC addresses an increasing demand for useful selection scan tools that can track in-progress evolutionary dynamics.

Fast and Flexible Estimation of Effective Migration Surfaces

10.1101/2020.08.07.242214 ◽

2020 ◽

Cited By ~ 1

Author(s):

Joseph H. Marcus ◽

Wooseok Ha ◽

Rina Foygel Barber ◽

John Novembre

Keyword(s):

Population Genetic ◽

Isolation By Distance ◽

Genetic Data ◽

Population Genetic Data ◽

Gray Wolves ◽

Related Method ◽

Fast Estimation ◽

Spatial Population ◽

Markov Random ◽

Coalescent Simulations

AbstractAn important feature in spatial population genetic data is often “isolation-by-distance,” where genetic differentiation tends to increase as individuals become more geographically distant. Recently, Petkova et al. (2016) developed a statistical method called Estimating Effective Migration Surfaces (EEMS) for visualizing spatially heterogeneous isolation-by-distance on a geographic map. While EEMS is a powerful tool for depicting spatial population structure, it can suffer from slow runtimes. Here we develop a related method called Fast Estimation of Effective Migration Surfaces (FEEMS). FEEMS uses a Gaussian Markov Random Field in a penalized likelihood framework that allows for efficient optimization and output of effective migration surfaces. Further, the efficient optimization facilitates the inference of migration parameters per edge in the graph, rather than per node (as in EEMS). When tested with coalescent simulations, FEEMS accurately recovers effective migration surfaces with complex gene-flow histories, including those with anisotropy. Applications of FEEMS to population genetic data from North American gray wolves shows it to perform comparably to EEMS, but with solutions obtained orders of magnitude faster. Overall, FEEMS expands the ability of users to quickly visualize and interpret spatial structure in their data.

Signals interpreted as archaic introgression appear to be driven primarily by faster evolution in Africa

Royal Society Open Science ◽

10.1098/rsos.191900 ◽

2020 ◽

Vol 7 (7) ◽

pp. 191900 ◽

Cited By ~ 1

Author(s):

William Amos

Keyword(s):

Real Data ◽

Heterozygous State ◽

Opposite Pattern ◽

Coalescent Simulations ◽

Archaic Introgression ◽

African Individual ◽

D Values

Non-African humans appear to carry a few per cent archaic DNA due to ancient inter-breeding. This modest legacy and its likely recent timing imply that most introgressed fragments will be rare and hence will occur mainly in the heterozygous state. I tested this prediction by calculating D statistics, a measure of legacy size, for pairs of humans where one of the pair was conditioned always to be either homozygous or heterozygous. Using coalescent simulations, I confirmed that conditioning the non-African to be heterozygous increased D, while conditioning the non-African to be homozygous reduced D to zero. Repeating with real data reveals the exact opposite pattern. In African–non-African comparisons, D is near-zero if the African individual is held homozygous. Conditioning one of two Africans to be either homozygous or heterozygous invariably generates large values of D, even when both individuals are drawn from the same population. Invariably, the African with more heterozygous sites (conditioned heterozygous > unconditioned > conditioned homozygous) appears less related to the archaic. By contrast, the same analysis applied to pairs of non-Africans always yields near-zero D, showing that conditioning does not create large D without an underlying signal to expose. Large D values in humans are therefore driven almost entirely by heterozygous sites in Africans acting to increase divergence from related taxa such as Neanderthals. In comparison with heterozygous Africans, individuals that lack African heterozygous sites, whether non-African or conditioned homozygous African, always appear more similar to archaic outgroups, a signal previously interpreted as evidence for introgression. I hope these analyses will encourage others to consider increased divergence as well as increased similarity to archaics as mechanisms capable of driving asymmetrical base-sharing.

coalescent simulations
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Theoretical Analysis of Principal Components in an Umbrella Model of Intraspecific Evolution

Evaluation of methods for the inference of ancestral recombination graphs

Addressing alpine plant phylogeography using integrative distributional, demographic and coalescent modeling

Mathematical constraints on FST: multiallelic markers in arbitrarily many populations

A test of the hypothesis that variable mutation rates create signals that have previously been interpreted as evidence of archaic introgression into humans

Pseudoreplication in genomics-scale datasets

CoaTran: Coalescent tree simulation along a transmission network

Discovery of ongoing selective sweeps within Anopheles mosquito populations using deep learning

Fast and Flexible Estimation of Effective Migration Surfaces

Signals interpreted as archaic introgression appear to be driven primarily by faster evolution in Africa

Export Citation Format

coalescent simulationsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Theoretical Analysis of Principal Components in an Umbrella Model of Intraspecific Evolution

Evaluation of methods for the inference of ancestral recombination graphs

Addressing alpine plant phylogeography using integrative distributional, demographic and coalescent modeling

Mathematical constraints on FST: multiallelic markers in arbitrarily many populations

A test of the hypothesis that variable mutation rates create signals that have previously been interpreted as evidence of archaic introgression into humans

Pseudoreplication in genomics-scale datasets

CoaTran: Coalescent tree simulation along a transmission network

Discovery of ongoing selective sweeps within Anopheles mosquito populations using deep learning

Fast and Flexible Estimation of Effective Migration Surfaces

Signals interpreted as archaic introgression appear to be driven primarily by faster evolution in Africa

coalescent simulations
Recently Published Documents