Systematic benchmark of ancient DNA read mapping

Author(s):  
Adrien Oliva ◽  
Raymond Tobler ◽  
Alan Cooper ◽  
Bastien Llamas ◽  
Yassine Souilmi

Abstract The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA ‘reads’) against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30–80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software—BWA-aln, BWA-mem, NovoAlign and Bowtie2—and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.

2019 ◽  
Author(s):  
Rui Martiniano ◽  
Erik Garrison ◽  
Eppie R. Jones ◽  
Andrea Manica ◽  
Richard Durbin

AbstractBackgroundDuring the last decade, the analysis of ancient DNA (aDNA) sequence has become a powerful tool for the study of past human populations. However, the degraded nature of aDNA means that aDNA molecules are short and frequently mutated by post-mortem chemical modifications. These features decrease read mapping accuracy and increase reference bias, in which reads containing non-reference alleles are less likely to be mapped than those containing reference alleles. Recently, alternative approaches for read mapping and genetic variation analysis have been developed that replace the linear reference by a variation graph which includes known alternative variants at each genetic locus. Here, we evaluate the use of variation graph software vg to avoid reference bias for ancient DNA and compare our approach to existing methods.ResultsWe used vg to align simulated and real aDNA samples to a variation graph containing 1000 Genome Project variants, and compared these with the same data aligned with bwa to the human linear reference genome. We show that use of vg leads to a balanced allelic representation at polymorphic sites, effectively removing reference bias, and more sensitive variant detection in comparison with bwa, especially for insertions and deletions (indels). Alternative approaches that use relaxed bwa parameter settings or filter bwa alignments can also reduce bias, but can have lower sensitivity than vg, particularly for indels.ConclusionsOur findings demonstrate that aligning aDNA sequences to variation graphs effectively mitigates the impact of reference bias when analysing aDNA, while retaining mapping sensitivity and allowing detection of variation, in particular indel variation, that was previously missed.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Rui Martiniano ◽  
Erik Garrison ◽  
Eppie R. Jones ◽  
Andrea Manica ◽  
Richard Durbin

Abstract Background During the last decade, the analysis of ancient DNA (aDNA) sequence has become a powerful tool for the study of past human populations. However, the degraded nature of aDNA means that aDNA molecules are short and frequently mutated by post-mortem chemical modifications. These features decrease read mapping accuracy and increase reference bias, in which reads containing non-reference alleles are less likely to be mapped than those containing reference alleles. Alternative approaches have been developed to replace the linear reference with a variation graph which includes known alternative variants at each genetic locus. Here, we evaluate the use of variation graph software to avoid reference bias for aDNA and compare with existing methods. Results We use to align simulated and real aDNA samples to a variation graph containing 1000 Genome Project variants and compare with the same data aligned with to the human linear reference genome. Using leads to a balanced allelic representation at polymorphic sites, effectively removing reference bias, and more sensitive variant detection in comparison with , especially for insertions and deletions (indels). Alternative approaches that use relaxed parameter settings or filter alignments can also reduce bias but can have lower sensitivity than , particularly for indels. Conclusions Our findings demonstrate that aligning aDNA sequences to variation graphs effectively mitigates the impact of reference bias when analyzing aDNA, while retaining mapping sensitivity and allowing detection of variation, in particular indel variation, that was previously missed.


2018 ◽  
Author(s):  
Torsten Günther ◽  
Carl Nettelblad

AbstractHigh quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map suc-cessfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele.In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp – reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudohaploid data, i.e. they randomly sample only one sequencing read per site.We show that reference bias is pervasive in published ancient DNA sequence data of pre-historic humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Reference bias can cause differences in the results of downstream analyses such as population affinities, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.


2017 ◽  
Author(s):  
Benedict Paten ◽  
Adam M. Novak ◽  
Jordan M. Eizenga ◽  
Garrison Erik

AbstractThe human reference genome is part of the foundation of modern human biology, and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph based models. Here, we survey various projects underway to build and apply these graph based structures—which we collectively refer to as genome graphs—and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.


2018 ◽  
Author(s):  
Cesare de Filippo ◽  
Matthias Meyer ◽  
Kay Prüfer

AbstractThe study of ancient DNA is hampered by degradation, resulting in short DNA fragments. Advances in laboratory methods have made it possible to retrieve short DNA fragments, thereby improving access to DNA preserved in highly degraded, ancient material. However, such material contains large amounts of microbial contamination in addition to DNA fragments from the ancient organism. The resulting mixture of sequences constitute a challenge for computational analysis, since microbial sequences are hard to distinguish from the ancient sequences of interest, especially when they are short. Here, we develop a method to quantify spurious alignments based on the presence or absence of rare variants. We find that spurious alignments are enriched for mismatches and insertion/deletion differences and lack substitution patterns typical of ancient DNA. The impact of spurious alignments can be reduced by filtering on these features and by imposing a sample-specific minimum length cutoff. We apply this approach to sequences from the ~430,000 year-old Sima de los Huesos hominin remains, which contain particularly short DNA fragments, and increase the amount of usable sequence data by 17-150%. This allows us to place a third specimen from the site on the Neandertal lineage. Our method maximizes the sequence data amenable to genetic analysis from highly degraded ancient material and avoids pitfalls that are associated with the analysis of ultra-short DNA sequences.


2019 ◽  
Vol 20 (S9) ◽  
Author(s):  
Giovanni Spirito ◽  
Damiano Mangoni ◽  
Remo Sanges ◽  
Stefano Gustincich

Abstract Background Transposable elements (TEs) are DNA sequences able to mobilize themselves and to increase their copy-number in the host genome. In the past, they have been considered mainly selfish DNA without evident functions. Nevertheless, currently they are believed to have been extensively involved in the evolution of primate genomes, especially from a regulatory perspective. Due to their recent activity they are also one of the primary sources of structural variants (SVs) in the human genome. By taking advantage of sequencing technologies and bioinformatics tools, recent surveys uncovered specific TE structural variants (TEVs) that gave rise to polymorphisms in human populations. When combined with RNA-seq data this information provides the opportunity to study the potential impact of TEs on gene expression in human. Results In this work, we assessed the effects of the presence of specific TEs in cis on the expression of flanking genes by producing associations between polymorphic TEs and flanking gene expression levels in human lymphoblastoid cell lines. By using public data from the 1000 Genome Project and the Geuvadis consortium, we exploited an expression quantitative trait loci (eQTL) approach integrated with additional bioinformatics data mining analyses. We uncovered human loci enriched for common, less common and rare TEVs and identified 323 significant TEV-cis-eQTL associations. SINE-R/VNTR/Alus (SVAs) resulted the TE class with the strongest effects on gene expression. We also unveiled differential functional enrichments on genes associated to TEVs, genes associated to TEV-cis-eQTLs and genes associated to the genomic regions mostly enriched in TEV-cis-eQTLs highlighting, at multiple levels, the impact of TEVs on the host genome. Finally, we also identified polymorphic TEs putatively embedded in transcriptional units, proposing a novel mechanism in which TEVs may mediate individual-specific traits. Conclusion We contributed to unveiling the effect of polymorphic TEs on transcription in lymphoblastoid cell lines.


2012 ◽  
Vol 30 (2) ◽  
pp. 253-262 ◽  
Author(s):  
Martyna Molak ◽  
Eline D. Lorenzen ◽  
Beth Shapiro ◽  
Simon Y.W. Ho

Abstract In recent years, ancient DNA has increasingly been used for estimating molecular timescales, particularly in studies of substitution rates and demographic histories. Molecular clocks can be calibrated using temporal information from ancient DNA sequences. This information comes from the ages of the ancient samples, which can be estimated by radiocarbon dating the source material or by dating the layers in which the material was deposited. Both methods involve sources of uncertainty. The performance of Bayesian phylogenetic inference depends on the information content of the data set, which includes variation in the DNA sequences and the structure of the sample ages. Various sources of estimation error can reduce our ability to estimate rates and timescales accurately and precisely. We investigated the impact of sample-dating uncertainties on the estimation of evolutionary timescale parameters using the software BEAST. Our analyses involved 11 published data sets and focused on estimates of substitution rate and root age. We show that, provided that samples have been accurately dated and have a broad temporal span, it might be unnecessary to account for sample-dating uncertainty in Bayesian phylogenetic analyses of ancient DNA. We also investigated the sample size and temporal span of the ancient DNA sequences needed to estimate phylogenetic timescales reliably. Our results show that the range of sample ages plays a crucial role in determining the quality of the results but that accurate and precise phylogenetic estimates of timescales can be made even with only a few ancient sequences. These findings have important practical consequences for studies of molecular rates, timescales, and population dynamics.


2021 ◽  
Author(s):  
Adrien Oliva ◽  
Raymond Tobler ◽  
Bastien Llamas ◽  
Yassine Souilmi

Xu and colleagues (Xu et al., 2021) recently suggested a new parameterisation of BWA-mem (Li, 2013) as an alternative to the current standard BWA-aln (Li and Durbin, 2009) to process ancient DNA sequencing data. The authors tested several combinations of the -k and -r parameters to optimise BWA-mem performance with degraded and contaminated ancient DNA samples. They report that using BWA-mem with -k 19 -r 2.5 parameters results in a mapping efficiency comparable to BWA-aln with -I 1024 -n 0.03 (i.e. a derivation of the standard parameters used in ancient DNA studies; (Schubert et al., 2012)), while achieving significantly faster run times. We recently performed a systematic benchmark of four mapping software (i.e. BWA-aln, BWA-mem, NovoAlign (http://www.novocraft.com/products/novoalign), and Bowtie2 (Langmead and Salzberg, 2012) for ancient DNA sequencing data and quantified their precision, accuracy, specificity, and impact on reference bias (Oliva et al., 2021). Notably, while multiple parameterisations were tested for BWA-aln, NovoAlign, and Bowtie2, we only tested BWA-mem with default parameters. Here, we use the alignment performance metrics from Oliva et al. to directly compare the recommended BWA-mem parameterisation reported in Xu et al. with the best performing alignment methods determined in the Oliva et al. benchmarks, and we make recommendations based on the results.


2021 ◽  
Author(s):  
Audald Lloret-Villas ◽  
Meenu Bhati ◽  
Naveen Kumar Kadri ◽  
Ruedi Fries ◽  
Hubert Pausch

AbstractBackgroundReference-guided read alignment and variant genotyping are prone to reference allele bias, particularly for samples that are greatly divergent from the reference genome. A Hereford-based assembly is the widely accepted bovine reference genome. Haplotype-resolved genomes that exceed the current bovine reference genome in quality and continuity have been assembled for different breeds of cattle. Using whole genome sequencing data of 161 Brown Swiss cattle, we compared the accuracy of read mapping and sequence variant genotyping as well as downstream genomic analyses between the bovine reference genome (ARS-UCD1.2) and a highly continuous Angus-based assembly (UOA_Angus_1).ResultsRead mapping accuracy did not differ notably between the ARS-UCD1.2 and UOA_Angus_1 assemblies. We discovered 22,744,517 and 22,559,675 high-quality variants from ARS-UCD1.2 and UOA_Angus_1, respectively. The concordance between sequence- and array-called genotypes was high and the number of variants deviating from Hardy-Weinberg proportions was low at segregating sites for both assemblies. More artefactual INDELs were genotyped from UOA_Angus_1 than ARS-UCD1.2 alignments. Using the composite likelihood ratio test, we detected 40 and 33 signatures of selection from ARS-UCD1.2 and UOA_Angus_1, respectively, but the overlap between both assemblies was low. Using the 161 sequenced Brown Swiss cattle as a reference panel, we imputed sequence variant genotypes into a mapping cohort of 30,499 cattle that had microarray-derived genotypes. The accuracy of imputation (Beagle R2) was very high (0.87) for both assemblies. Genome-wide association studies between imputed sequence variant genotypes and six dairy traits as well as stature produced almost identical results from both assemblies.ConclusionsThe ARS-UCD1.2 and UOA_Angus_1 assemblies are suitable for reference-guided genome analyses in Brown Swiss cattle. Although differences in read mapping and genotyping accuracy between both assemblies are negligible, the choice of the reference genome has a large impact on detecting signatures of selection using the composite likelihood ratio test. We developed a workflow that can be adapted and reused to compare the impact of reference genomes on genome analyses in various breeds, populations and species.


2020 ◽  
Author(s):  
Rui Martiniano ◽  
Bianca De Sanctis ◽  
Pille Hallast ◽  
Richard Durbin

AbstractDuring the last decade, large volumes of ancient DNA (aDNA) data have been generated as part of whole-genome shotgun and target capture sequencing studies. This includes sequences from non-recombining loci such as the mitochondrial or Y chromosomes. However, given the highly degraded nature of aDNA data, post-mortem deamination and often low genomic coverage, combining ancient and modern samples for phylogenetic analyses remains difficult. Without care, these factors can lead to incorrect placement.For the Y chromosomes, current standard methods focus on curated markers, but these contain only a subset of the total variation. Examining all polymorphic markers is particularly important for low coverage aDNA data because it substantially increases the number of overlapping sites between present-day and ancient individuals which may lead to higher resolution phylogenetic placement. We provide an automated workflow for jointly analysing ancient and present-day sequence data in a phylogenetic context. For each ancient sample, we effectively evaluate the number of ancestral and derived alleles present on each branch and use this information to place ancient lineages to their most likely position in the phylogeny. We provide both a parsimony approach and a highly optimised likelihood-based approach that assigns a posterior probability to each branch.To illustrate the application of this method, we have compiled and make available the largest public Y-chromosomal dataset to date (2,014 samples) which we used as a reference for phylogenetic placement. We process publicly available African ancient DNA Y-chromosome sequences and examine how patterns of Y-chromosomal diversity change across time and the relationship between ancient and present-day lineages. The same software can be used to place samples with large amounts of missing data into other large non-recombining phylogenies such as the mitochondrial tree.


Sign in / Sign up

Export Citation Format

Share Document