scholarly journals Predicting geographic location from genetic variation with deep neural networks

eLife ◽  
2020 ◽  
Vol 9 ◽  
Author(s):  
CJ Battey ◽  
Peter L Ralph ◽  
Andrew D Kern

Most organisms are more closely related to nearby than distant members of their species, creating spatial autocorrelations in genetic data. This allows us to predict the location of origin of a genetic sample by comparing it to a set of samples of known geographic origin. Here, we describe a deep learning method, which we call Locator, to accomplish this task faster and more accurately than existing approaches. In simulations, Locator infers sample location to within 4.1 generations of dispersal and runs at least an order of magnitude faster than a recent model-based approach. We leverage Locator’s computational efficiency to predict locations separately in windows across the genome, which allows us to both quantify uncertainty and describe the mosaic ancestry and patterns of geographic mixing that characterize many populations. Applied to whole-genome sequence data from Plasmodium parasites, Anopheles mosquitoes, and global human populations, this approach yields median test errors of 16.9km, 5.7km, and 85km, respectively.

Author(s):  
C.J. Battey ◽  
Peter L. Ralph ◽  
Andrew D. Kern

AbstractMost organisms are more closely related to nearby than distant members of their species, creating spatial autocorrelations in genetic data. This allows us to predict the location of origin of a genetic sample by comparing it to a set of samples of known geographic origin. Here we describe a deep learning method, which we call Locator, to accomplish this task faster and more accurately than existing approaches. In simulations, Locator infers sample location to within 4.1 generations of dispersal and runs at least an order of magnitude faster than a recent model-based approach. We leverage Locator’s computational efficiency to predict locations separately in windows across the genome, which allows us to both quantify uncertainty and describe the mosaic ancestry and patterns of geographic mixing that characterize many populations. Applied to whole-genome sequence data from Plasmodium parasites, Anopheles mosquitoes, and global human populations, this approach yields median test errors of 16.9km, 5.7km, and 85km, respectively.


2021 ◽  
Vol 99 (Supplement_3) ◽  
pp. 24-24
Author(s):  
Jicai Jiang ◽  
Li Ma ◽  
Jeffrey O’Connell

Abstract Partitioning SNP heritability by many functional annotations has been a successful tool for understanding the genetic architecture of complex traits in human genetic studies. Similar analyses are being extended to animal research, as (imputed) whole-genome sequence data of many individuals and various functional annotations have become available in livestock animals. Though many approaches have been developed for heritability partition (e.g., LDSC and HE-reg), they are mostly based on approximations tailored to human populations and few can produce statistically efficient estimates for animal genomic studies where individuals are often related. To tackle this issue, we present a stochastic MINQUE (Minimum Norm Quadratic Unbiased Estimation) approach for partitioning SNP heritability, which we refer to as MPH. We provide a theoretical analysis comparing LDSC and HE-reg with REML and MPH and demonstrate what LDSC and HE-reg (and similar methods) take advantage of in their approximations: sparse relationships between individuals and relatively weak linkage disequilibrium. We also show that our method is mathematically equivalent to the MC-REML approach implemented in BOLT. MPH has three key features. First, it is comparable to genomic REML in terms of accuracy, while being at least one order of magnitude faster than GCTA and BOLT and using only ~1/4 of memory as much as GCTA, when applied to sequence data and many variance components (or functional annotation categories). Second, it can do weighted analyses if residual variances are unequal (such as DYD). Third, it works for many overlapping functional annotations. Using simulations based on a human pedigree and a dairy cattle pedigree, we illustrate the benefits of our method for partitioning SNP heritability in pedigree-based studies. We also demonstrate that it is feasible to efficiently partition SNP heritability for animal genomes with strong, long-span LD. MPH is freely available at https://jiang18.github.io/mph.


Author(s):  
Glenn-Peter Sætre ◽  
Mark Ravinet

Evolutionary genetics is the study of how genetic variation leads to evolutionary change. With the recent explosion in the availability of whole genome sequence data, vast quantities of genetic data are being generated at an ever-increasing pace with the result that programming has become an essential tool for researchers. Most importantly, a thorough understanding of evolutionary principles is essential for making sense of this genetic data. This up-to-date textbook covers all the major components of modern evolutionary genetics, carefully explaining fundamental processes such as mutation, natural selection, genetic drift, and speciation, together with their consequences. In addition to the text, study questions are provided to motivate the reader to think and reflect on the concepts in each chapter. Practical experience is essential when it comes to developing an understanding of how to use genetic data to analyze and address interesting questions in the life sciences and how to interpret results in meaningful ways. Throughout the book, a series of online, computer-based tutorials serves as an introduction to programming and analysis of evolutionary genetic data centered on the R programming language, which stands out as an ideal all-purpose platform to handle and analyze such data. The book and its online materials take full advantage of the authors’ own experience in working in a post-genomic revolution world, and introduce readers to the plethora of molecular and analytical methods that have only recently become available.


2020 ◽  
Vol 6 (1) ◽  
Author(s):  
Diego Forni ◽  
Rachele Cagliani ◽  
Mario Clerici ◽  
Uberto Pozzoli ◽  
Manuela Sironi

Abstract Human betaherpesviruses 6A and 6B (HHV-6A and HHV-6B) are highly prevalent in human populations. The genomes of these viruses can be stably integrated at the telomeres of human chromosomes and be vertically transmitted (inherited chromosomally integrated HHV-6A/HHV-6B, iciHHV-6A/iciHHV-6B). We reconstructed the population structures of HHV-6A and HHV-6B, showing that HHV-6A diverged less than HHV-6B genomes from the projected common ancestral population. Thus, HHV-6B genomes experienced stronger drift, as also supported by calculation of nucleotide diversity and Tajima’s D. Analysis of ancestry proportions indicated that HHV-6A exogenous viruses and iciHHV-6A derived most of their genomes from distinct ancestral sources. Conversely, ancestry proportions were similar in exogenous HHV-6B viruses and iciHHV-6B. In line with previous indications, this suggests the distinct exogenous viral populations that originated iciHHV-6B in subjects with European and Asian ancestry are still causing infections in the corresponding geographic areas. Notably, for both iciHHV-6A and iciHHV-6B, we found that European and American sequences tend to have high proportions of ancestry from viral populations that experienced considerable drift, suggesting that they underwent one or more bottlenecks followed by population expansion. Finally, analysis of HHV-6B exogenous viruses sampled in Japan indicated that proportions of ancestry components of most of these viruses are different from the majority of those sampled in the USA. More generally, we show that, in both viral species, both integrated and exogenous viral genomes have different ancestry components, partially depending on geographic location. It would be extremely important to determine whether such differences account for the diversity of HHV-6A/HHV-6B-associated clinical symptoms and epidemiology. Also, the sequencing of additional exogenous and integrated viral genomes will be instrumental to confirm and expand our conclusions, which are based on a relatively small number of genomes, sequenced with variable quality, and with unequal sampling in terms of geographic origin.


2021 ◽  
Author(s):  
Erik Volz ◽  
Swapnil Mishra ◽  
Meera Chand ◽  
Jeffrey C. Barrett ◽  
Robert Johnson ◽  
...  

AbstractThe SARS-CoV-2 lineage B.1.1.7, now designated Variant of Concern 202012/01 (VOC) by Public Health England, originated in the UK in late Summer to early Autumn 2020. We examine epidemiological evidence for this VOC having a transmission advantage from several perspectives. First, whole genome sequence data collected from community-based diagnostic testing provides an indication of changing prevalence of different genetic variants through time. Phylodynamic modelling additionally indicates that genetic diversity of this lineage has changed in a manner consistent with exponential growth. Second, we find that changes in VOC frequency inferred from genetic data correspond closely to changes inferred by S-gene target failures (SGTF) in community-based diagnostic PCR testing. Third, we examine growth trends in SGTF and non-SGTF case numbers at local area level across England, and show that the VOC has higher transmissibility than non-VOC lineages, even if the VOC has a different latent period or generation time. Available SGTF data indicate a shift in the age composition of reported cases, with a larger share of under 20 year olds among reported VOC than non-VOC cases. Fourth, we assess the association of VOC frequency with independent estimates of the overall SARS-CoV-2 reproduction number through time. Finally, we fit a semi-mechanistic model directly to local VOC and non-VOC case incidence to estimate the reproduction numbers over time for each. There is a consensus among all analyses that the VOC has a substantial transmission advantage, with the estimated difference in reproduction numbers between VOC and non-VOC ranging between 0.4 and 0.7, and the ratio of reproduction numbers varying between 1.4 and 1.8. We note that these estimates of transmission advantage apply to a period where high levels of social distancing were in place in England; extrapolation to other transmission contexts therefore requires caution.


2018 ◽  
Author(s):  
Aaron P. Ragsdale ◽  
Claudia Moreau ◽  
Simon Gravel

AbstractEvolutionary, biological, and demographic processes combine to shape the variation observed in populations. Understanding how these processes are expected to influence variation allows us to infer past demographic events and the nature of selection in human populations. Forward models such as the diffusion approximation provide a powerful tool for analyzing the distribution of allele frequencies in contemporary populations due to their computational tractability and model flexibility. Here, we discuss recent computational developments and their application to reconstructing human demographic history and patterns of selection at new mutations. We also reexamine how some classical assumptions that are still commonly used in inference studies fare when applied to modern data. We use whole-genome sequence data for 797 French Canadian individuals to examine the neutrality of synonymous sites. We find that selection can lead to strong biases in the inferred demography, mutation rate, and distributions of fitness effects. We use these distributions of fitness effects together with demographic and phenotype-fitness models to predict the relationship between effect size and allele frequency, and contrast those predictions to commonly used models in statistical genetics. Thus the simple evolutionary models investigated by Kimura and Ohta still provide important insight into modern genetic research.


2016 ◽  
Vol 106 (6) ◽  
pp. 636-644 ◽  
Author(s):  
Marie-Claude Gagnon ◽  
Theo A. J. van der Lee ◽  
Peter J. M. Bonants ◽  
Donna S. Smith ◽  
Xiang Li ◽  
...  

Synchytrium endobioticum is the fungal agent causing potato wart disease. Because of its severity and persistence, quarantine measures are enforced worldwide to avoid the spread of this disease. Molecular markers exist for species-specific detection of this pathogen, yet markers to study the intraspecific genetic diversity of S. endobioticum were not available. Whole-genome sequence data from Dutch pathotype 1 isolate MB42 of S. endobioticum were mined for perfect microsatellite motifs. Of the 62 selected microsatellites, 21 could be amplified successfully and displayed moderate levels of polymorphism in 22 S. endobioticum isolates from different countries. Nineteen multilocus genotypes were observed, with only three isolates from Canada displaying identical profiles. The majority of isolates from Canada clustered genetically. In contrast, most isolates collected in Europe show no genetic clustering associated with their geographic origin. S. endobioticum isolates with the same pathotype displayed highly variable genotypes and none of the microsatellite markers correlated with a specific pathotype. The markers developed in this study can be used to assess intraspecific genetic diversity of S. endobioticum and allow track and trace of genotypes that will generate a better understanding of the migration and spread of this important fungal pathogen and support management of this disease.


Author(s):  
Amnon Koren ◽  
Dashiell J Massey ◽  
Alexa N Bracci

Abstract Motivation Genomic DNA replicates according to a reproducible spatiotemporal program, with some loci replicating early in S phase while others replicate late. Despite being a central cellular process, DNA replication timing studies have been limited in scale due to technical challenges. Results We present TIGER (Timing Inferred from Genome Replication), a computational approach for extracting DNA replication timing information from whole genome sequence data obtained from proliferating cell samples. The presence of replicating cells in a biological specimen leads to non-uniform representation of genomic DNA that depends on the timing of replication of different genomic loci. Replication dynamics can hence be observed in genome sequence data by analyzing DNA copy number along chromosomes while accounting for other sources of sequence coverage variation. TIGER is applicable to any species with a contiguous genome assembly and rivals the quality of experimental measurements of DNA replication timing. It provides a straightforward approach for measuring replication timing and can readily be applied at scale. Availability and Implementation TIGER is available at https://github.com/TheKorenLab/TIGER. Supplementary information Supplementary data are available at Bioinformatics online


Sign in / Sign up

Export Citation Format

Share Document