scholarly journals Inferring heterozygosity from ancient and low coverage genomes

2016 ◽  
Author(s):  
Athanasios Kousathanas ◽  
Christoph Leuenberger ◽  
Vivian Link ◽  
Christian Sell ◽  
Joachim Burger ◽  
...  

ABSTRACTWhile genetic diversity can be quantified accurately from high coverage sequencing, it is often desirable to obtain such estimates from low coverage data, either to save costs or because of low DNA quality as observed for ancient samples. Here we introduce a method to accurately infer heterozygosity probabilistically from very low coverage sequences of a single individual. The method relaxes the infinite sites assumption of previous methods, does not require a reference sequence and takes into account both variable sequencing errors and potential post-mortem damage. It is thus also applicable to non-model organisms and ancient genomes. Since error rates as reported by sequencing machines are generally distorted and require recalibration, we also introduce a method to infer accurately recalibration parameter in the presence of post-mortem damage. This method does also not require knowledge about the underlying genome sequence, but instead works from haploid data (e.g. from the X-chromosome from mammalian males) and integrates over the unknown genotypes. Using extensive simulations we show that a few Mb of haploid data is sufficient for accurate recalibration even at average coverages as low as 1-3x. At similar coverages, out method also produces very accurate estimates of heterozygosity down to 10−4 within windows of about 1Mb. We further illustrate the usefulness of our approach by inferring genome-wide patterns of diversity for several ancient human samples and found that 3,000-5,000 samples showed diversity patterns comparable to modern humans. In contrast, two European hunter-gatherer samples exhibited not only considerably lower levels of diversity than modern samples, but also highly distinct distributions of diversity along their genomes. Interestingly, these distributions were also very differently between the two samples, supporting earlier conclusions of a highly diverse and structured population in Europe prior to the arrival of farming.

2018 ◽  
Author(s):  
Susanne Tilk ◽  
Alan Bergland ◽  
Aaron Goodman ◽  
Paul Schmidt ◽  
Dmitri Petrov ◽  
...  

AbstractEvolve-and-resequence (E+R) experiments leverage next-generation sequencing technology to track the allele frequency dynamics of populations as they evolve. While previous work has shown that adaptive alleles can be detected by comparing frequency trajectories from many replicate populations, this power comes at the expense of high-coverage (>100x) sequencing of many pooled samples, which can be cost-prohibitive. Here, we show that accurate estimates of allele frequencies can be achieved with very shallow sequencing depths (<5x) via inference of known founder haplotypes in small genomic windows. This technique can be used to efficiently estimate frequencies for any number of bi-allelic SNPs in populations of any model organism founded with sequenced homozygous strains. Using both experimentally-pooled and simulated samples of Drosophila melanogaster, we show that haplotype inference can improve allele frequency accuracy by orders of magnitude for up to 50 generations of recombination, and is robust to moderate levels of missing data, as well as different selection regimes. Finally, we show that a simple linear model generated from these simulations can predict the accuracy of haplotype-derived allele frequencies in other model organisms and experimental designs. To make these results broadly accessible for use in E+R experiments, we introduce HAF-pipe, an open-source software tool for calculating haplotype-derived allele frequencies from raw sequencing data. Ultimately, by reducing sequencing costs without sacrificing accuracy, our method facilitates E+R designs with higher replication and resolution, and thereby, increased power to detect adaptive alleles.


Genetics ◽  
2019 ◽  
Vol 212 (3) ◽  
pp. 587-614 ◽  
Author(s):  
Gabriel Renaud ◽  
Kristian Hanghøj ◽  
Thorfinn Sand Korneliussen ◽  
Eske Willerslev ◽  
Ludovic Orlando

Both the total amount and the distribution of heterozygous sites within individual genomes are informative about the genetic diversity of the population they belong to. Detecting true heterozygous sites in ancient genomes is complicated by the generally limited coverage achieved and the presence of post-mortem damage inflating sequencing errors. Additionally, large runs of homozygosity found in the genomes of particularly inbred individuals and of domestic animals can skew estimates of genome-wide heterozygosity rates. Current computational tools aimed at estimating runs of homozygosity and genome-wide heterozygosity levels are generally sensitive to such limitations. Here, we introduce ROHan, a probabilistic method which substantially improves the estimate of heterozygosity rates both genome-wide and for genomic local windows. It combines a local Bayesian model and a Hidden Markov Model at the genome-wide level and can work both on modern and ancient samples. We show that our algorithm outperforms currently available methods for predicting heterozygosity rates for ancient samples. Specifically, ROHan can delineate large runs of homozygosity (at megabase scales) and produce a reliable confidence interval for the genome-wide rate of heterozygosity outside of such regions from modern genomes with a depth of coverage as low as 5–6× and down to 7–8× for ancient samples showing moderate DNA damage. We apply ROHan to a series of modern and ancient genomes previously published and revise available estimates of heterozygosity for humans, chimpanzees and horses.


2018 ◽  
Author(s):  
Roger Ros-Freixedes ◽  
Battagin Mara ◽  
Martin Johnsson ◽  
Gregor Gorjanc ◽  
Alan J Mileham ◽  
...  

AbstractBackgroundInherent sources of error and bias that affect the quality of the sequence data include index hopping and bias towards the reference allele. The impact of these artefacts is likely greater for low-coverage data than for high-coverage data because low-coverage data has scant information and standard tools for processing sequence data were designed for high-coverage data. With the proliferation of cost-effective low-coverage sequencing there is a need to understand the impact of these errors and bias on resulting genotype calls.ResultsWe used a dataset of 26 pigs sequenced both at 2x with multiplexing and at 30x without multiplexing to show that index hopping and bias towards the reference allele due to alignment had little impact on genotype calls. However, pruning of alternative haplotypes supported by a number of reads below a predefined threshold, a default and desired step for removing potential sequencing errors in high-coverage data, introduced an unexpected bias towards the reference allele when applied to low-coverage data. This bias reduced best-guess genotype concordance of low-coverage sequence data by 19.0 absolute percentage points.ConclusionsWe propose a simple pipeline to correct this bias and we recommend that users of low-coverage sequencing be wary of unexpected biases produced by tools designed for high-coverage sequencing.


2021 ◽  
Author(s):  
Pere Gelabert ◽  
Susanna Sawyer ◽  
Anders Bergstrom ◽  
Thomas C. Collin ◽  
Tengiz Meshvelian ◽  
...  

Archaeological sediments have been shown to preserve ancient DNA, but so far have not yielded genome-scale information of the magnitude of skeletal remains. We retrieved and analysed human and mammalian low-coverage nuclear and high-coverage mitochondrial genomes from Upper Palaeolithic sediments from Satsurblia cave, western Georgia, dated to 25,000 years ago. First, a human female genome with substantial basal Eurasian ancestry, which was an ancestry component of the majority of post-Ice Age people in the Near East, North Africa, and parts of Europe. Second, a wolf genome that is basal to extant Eurasian wolves and dogs and represents a previously unknown, likely extinct, Caucasian lineage that diverged from the ancestors of modern wolves and dogs before these diversified. Third, a bison genome that is basal to present-day populations, suggesting that population structure has been substantially reshaped since the Last Glacial Maximum. Our results provide new insights into the late Pleistocene genetic histories of these three species, and demonstrate that sediment DNA can be used not only for species identification, but also be a source of genome-wide ancestry information and genetic history.


2021 ◽  
Author(s):  
Leo Speidel ◽  
Lara Cassidy ◽  
Robert W. Davies ◽  
Garrett Hellenthal ◽  
Pontus Skoglund ◽  
...  

AbstractAncient genomes anchor genealogies in directly observed historical genetic variation, and contextualise ancestral lineages with archaeological insights into their geography and lifestyles. We introduce an extension of the Relate algorithm to incorporate ancient genomes and reconstruct the joint genealogies of 14 previously published high-coverage ancients and 278 present-day individuals of the Simons Genome Diversity Project. As the majority of ancient genomes are of lower coverage and cannot be directly built into genealogies, we additionally present a fast and scalable method, Colate, for inferring coalescence rates between low-coverage genomes without requiring phasing or imputation. Our method leverages sharing patterns of mutations dated using a genealogy to construct a likelihood, which is maximised using an expectation-maximisation algorithm. We apply Colate to 430 ancient human shotgun genomes of >0.5x mean coverage. Using Relate and Colate, we characterise dynamic population structure, such as repeated partial population replacements in Ireland, and gene-flow between early farmer and European hunter-gatherer groups. We further show that the previously reported increase in the TCC/TTC mutation rate, which is strongest in West Eurasians among present-day people, was already widespread across West Eurasia in the Late Glacial Period ~10k - 15k years ago, is strongest in Neolithic and Anatolian farmers, and is remarkably well predicted by the coalescence rates between other genomes and a 10,000-year-old Anatolian individual. This suggests that the driver of this signal originated in ancestors of ancient Anatolia >14k years ago, but was already absent by the Mesolithic and may indicate a genetic link between the Near East and European hunter-gatherer groups in the Late Paleolithic.


Genes ◽  
2019 ◽  
Vol 10 (8) ◽  
pp. 561 ◽  
Author(s):  
Luca Ferretti ◽  
Chandana Tennakoon ◽  
Adrian Silesian ◽  
Graham Freimanis andPaolo Ribeca

Current high-throughput sequencing technologies can generate sequence data and provide information on the genetic composition of samples at very high coverage. Deep sequencing approaches enable the detection of rare variants in heterogeneous samples, such as viral quasi-species, but also have the undesired effect of amplifying sequencing errors and artefacts. Distinguishing real variants from such noise is not straightforward. Variant callers that can handle pooled samples can be in trouble at extremely high read depths, while at lower depths sensitivity is often sacrificed to specificity. In this paper, we propose SiNPle (Simplified Inference of Novel Polymorphisms from Large coveragE), a fast and effective software for variant calling. SiNPle is based on a simplified Bayesian approach to compute the posterior probability that a variant is not generated by sequencing errors or PCR artefacts. The Bayesian model takes into consideration individual base qualities as well as their distribution, the baseline error rates during both the sequencing and the PCR stage, the prior distribution of variant frequencies and their strandedness. Our approach leads to an approximate but extremely fast computation of posterior probabilities even for very high coverage data, since the expression for the posterior distribution is a simple analytical formula in terms of summary statistics for the variants appearing at each site in the genome. These statistics can be used to filter out putative SNPs and indels according to the required level of sensitivity. We tested SiNPle on several simulated and real-life viral datasets to show that it is faster and more sensitive than existing methods. The source code for SiNPle is freely available to download and compile, or as a Conda/Bioconda package.


2021 ◽  
Vol 4 ◽  
Author(s):  
Izabela Mendes ◽  
Heron Hilário ◽  
Daniel Teixeira ◽  
Daniel Cardoso de Carvalho

Species richness is a metric of biodiversity usually used in fish community assessment for monitoring programs. This metric is often obtained using traditional fisheries methods that rely on capture of target organisms, resulting in underestimation of fish species. DNA metabarcoding has been recognized as a powerful noninvasive alternative tool for fish biomonitoring and management. Despite the increasing popularity of this method for the assessment of aquatic megadiverse ecosystems, its implementation for studying the highly diverse Neotropical ichthyofauna still presents some challenges. One of them is to devise what primer set could reliably amplify the DNA of all fish species from a megadiverse river basin and have enough resolution to identify them. In order to identify and overcome these drawbacks, we have investigated the efficiency of the metabarcoding approach on Neotropical fishes using a mock sample containing genomic DNA of 18 fish species from the Jequitinhonha River basin, Eastern Brazil. We compared three primer sets targeting the 12S rRNA gene: two universal and widely used markers for fish metabarcoding [MiFish (~170bp) and Teleo_1 (~60bp)], and NeoFish (~190bp), recently developed by our research group specifically for the identification of Neotropical fishes (Milan et al., 2020). Two samples amplified using three primers were sequenced in a single multiplexed Illumina MiniSeq run, using normalized and non-normalized pools. Bioinformatic analyses were performed using a DADA2/Phyloseq based pipeline to perform filtering steps and to assign Amplicon Sequence Variants (ASVs). We used a custom 12S reference sequence database that included 190 specimens representing 101 species and 70 genera from the Jequitinhonha and São Francisco river basins. A total of 187 ASVs were recovered: 79, 66 and 42 for NeoFish, MiFish and Teleo_1, respectively. ASVs of unexpected species were identified for both pools (Fig. 1), though each of these ASVs had an abundance of less than 50 copies. In addition, species of the Hoplias and Prochilodus genera could not be identified at the species level, due to identical sequences within each genus, possibly because of the insufficient variation within the 12S region recovered by these primers’ amplicons. Unexpectedly, although a single individual of each species was placed in the pools, more than one ASV was identified for some species, likely caused by PCR biases. Overall, all primer sets displayed similar taxonomic resolution for the DNA pools and recovered all species, except for NeoFish, which could not detect Steindachneridion amblyurum due to an incompatibility in the 3’ of the NeoFish forward primer and Teleo_1, which could not identify Steindachnerina elegans. These results highlight the need of reliable databases in order to enable the full assignment of ASVs and OTUs to species level, and the importance of calibrating the DNA metabarcoding approach with mock samples to identify weaknesses and pivotal steps prior to the application on large scale DNA based biodiversity evaluation, that can help with the complex task of conserving the megadiverse Neotropical ichthyofauna.


2017 ◽  
Author(s):  
Wilfried M. Guiblet ◽  
Marzia A. Cremona ◽  
Monika Cechova ◽  
Robert S. Harris ◽  
Iva Kejnovska ◽  
...  

ABSTRACTDNA conformation may deviate from the classical B-form in ~13% of the human genome. Non-B DNA regulates many cellular processes; however, its effects on DNA polymerization speed and accuracy have not been investigated genome-wide. Such an inquiry is critical for understanding neurological diseases and cancer genome instability. Here we present the first simultaneous examination of DNA polymerization kinetics and errors in the human genome sequenced with Single-Molecule-Real-Time technology. We show that polymerization speed differs between non-B and B-DNA: it decelerates at G-quadruplexes and fluctuates periodically at disease-causing tandem repeats. Analyzing polymerization kinetics profiles, we predict and validate experimentally non-B DNA formation for a novel motif. We demonstrate that several non-B motifs affect sequencing errors (e.g., G-quadruplexes increase error rates) and that sequencing errors are positively associated with polymerase slowdown. Finally, we show that highly divergent G4 motifs have pronounced polymerization slowdown and high sequencing error rates, suggesting similar mechanisms for sequencing errors and germline mutations.


2018 ◽  
Author(s):  
Xinzhu Zhou ◽  
Celine L. St. Pierre ◽  
Natalia M. Gonzales ◽  
Riyan Cheng ◽  
Apurva Chitre ◽  
...  

AbstractReplication is considered to be critical for genome-wide association studies (GWAS) in humans, but is not routinely performed in model organisms. We explored replication using an advanced intercross line (AIL) which is the simplest possible multigenerational intercross. We re-genotyped a previously published cohort of LG/J x SM/J AIL mice (F34; n=428) using a denser marker set and also genotyped a novel cohort of AIL mice (F39-43; n=600) for the first time. We identified 110 significant loci in the F34 cohort, 36 of which were new discoveries attributable to the denser marker set; we also identified 27 novel significant loci in the F39-43 cohort. For traits measured in both cohorts (locomotor activity, body weight, and coat color), the genetic correlations were high, although, the F39-43 cohort showed systematically lower SNP-heritability estimates. We then attempted to replicate loci identified in either F34 or F39-43 in the other cohort. Albino coat color was robustly replicated; we observed only partial replication of associations for locomotor activity and body weight. Finally, we performed a mega-analysis of locomotor activity and body weight by combining F34 and F39-43 cohorts (n=1,028), which identified four novel loci. The incomplete replication was inconsistent with simulations we performed to estimate our power to replicate. This may reflect: 1) false positives errors in the discovery cohort, 2) environmental or genetic heterogeneity between the two samples, or 3) the systematic over estimation of the effect sizes at significant loci (“Winner’s Curse”). Our results demonstrate that it is difficult to replicate GWAS results even when using similarly sized discovery and replication cohorts drawn from the same population.


Sign in / Sign up

Export Citation Format

Share Document