reference bias
Recently Published Documents


TOTAL DOCUMENTS

44
(FIVE YEARS 20)

H-INDEX

9
(FIVE YEARS 3)

2021 ◽  
Author(s):  
Kristiina Ausmees ◽  
Federico Sanchez-Quinto ◽  
Mattias Jakobsson ◽  
Carl Nettelblad

With capabilities of sequencing ancient DNA to high coverage often limited by sample quality or cost, imputation of missing genotypes presents a possibility to increase power of inference as well as cost-effectiveness for the analysis of ancient data. However, the high degree of uncertainty often associated with ancient DNA poses several methodological challenges, and performance of imputation methods in this context has not been fully explored. To gain further insights, we performed a systematic evaluation of imputation of ancient data using Beagle 4.0 and reference data from phase 3 of the 1000 Genomes project, investigating the effects of coverage, phased reference and study sample size. Making use of five ancient samples with high-coverage data available, we evaluated imputed data with respect to accuracy, reference bias and genetic affinities as captured by PCA. We obtained genotype concordance levels of over 99% for data with 1x coverage, and similar levels of accuracy and reference bias at levels as low as 0.75x. Our findings suggest that using imputed data can be a realistic option for various population genetic analyses even for data in coverage ranges below 1x. We also show that a large and varied phased reference set as well as the inclusion of low- to moderate-coverage ancient samples can increase imputation performance, particularly for rare alleles. In-depth analysis of imputed data with respect to genetic variants and allele frequencies gave further insight into the nature of errors arising during imputation, and can provide practical guidelines for post-processing and validation prior to downstream analysis.


2021 ◽  
Author(s):  
Ivar Grytten ◽  
Knut D. Rand ◽  
Geir K. Sandve

AbstractOne of the core applications of high-throughput sequencing is the characterization of individual genetic variation. Traditionally, variants have been inferred by comparing sequenced reads to a reference genome. There has recently been an emergence of genotyping methods, which instead infer variants of an individual based on variation present in population-scale repositories like the 1000 Genomes Project. However, commonly used methods for genotyping are slow since they still require mapping of reads to a reference genome. Also, since traditional reference genomes do not include genetic variation, traditional genotypers suffer from reference bias and poor accuracy in variation-rich regions where reads cannot accurately be mapped.We here present KAGE, a genotyper for SNPs and short indels that is inspired by recent developments within graph-based genome representations and alignment-free genotyping. We propose two novel ideas to improve both the speed and accuracy: we (1) use known genotypes from thousands of individuals in a Bayesian model to predict genotypes, and (2) propose a computationally efficient method for leveraging correlation between variants.We show through experiments on experimental data that KAGE is both faster and more accurate than other alignment-free genotypers. KAGE is able to genotype a new sample (15x coverage) in less than half an hour on a consumer laptop, more than 10 times faster than the fastest existing methods, making it ideal in clinical settings or when large numbers of individuals are to be genotyped at low computational cost.


2021 ◽  
Vol 1 (1) ◽  
Author(s):  
Linda Armbrecht ◽  
Raphael Eisenhofer ◽  
José Utge ◽  
Elizabeth C. Sibert ◽  
Fabio Rocha ◽  
...  

AbstractSedimentary ancient DNA (sedaDNA) analyses are increasingly used to reconstruct marine ecosystems. The majority of marine sedaDNA studies use a metabarcoding approach (extraction and analysis of specific DNA fragments of a defined length), targeting short taxonomic marker genes. Promising examples are 18S-V9 rRNA (~121–130 base pairs, bp) and diat-rbcL (76 bp), targeting eukaryotes and diatoms, respectively. However, it remains unknown how 18S-V9 and diat-rbcL derived compositional profiles compare to metagenomic shotgun data, the preferred method for ancient DNA analyses as amplification biases are minimised. We extracted DNA from five Santa Barbara Basin sediment samples (up to ~11 000 years old) and applied both a metabarcoding (18S-V9 rRNA, diat-rbcL) and a metagenomic shotgun approach to (i) compare eukaryote, especially diatom, composition, and (ii) assess sequence length and database related biases. Eukaryote composition differed considerably between shotgun and metabarcoding data, which was related to differences in read lengths (~112 and ~161 bp, respectively), and overamplification of short reads in metabarcoding data. Diatom composition was influenced by reference bias that was exacerbated in metabarcoding data and characterised by increased representation of Chaetoceros, Thalassiosira and Pseudo-nitzschia. Our results are relevant to sedaDNA studies aiming to accurately characterise paleo-ecosystems from either metabarcoding or metagenomic data.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Rachel M. Colquhoun ◽  
Michael B. Hall ◽  
Leandro Lima ◽  
Leah W. Roberts ◽  
Kerri M. Malone ◽  
...  

AbstractWe present pandora, a novel pan-genome graph structure and algorithms for identifying variants across the full bacterial pan-genome. As much bacterial adaptability hinges on the accessory genome, methods which analyze SNPs in just the core genome have unsatisfactory limitations. Pandora approximates a sequenced genome as a recombinant of references, detects novel variation and pan-genotypes multiple samples. Using a reference graph of 578 Escherichia coli genomes, we compare 20 diverse isolates. Pandora recovers more rare SNPs than single-reference-based tools, is significantly better than picking the closest RefSeq reference, and provides a stable framework for analyzing diverse samples without reference bias.


2021 ◽  
Author(s):  
Adrien Oliva ◽  
Raymond Tobler ◽  
Bastien Llamas ◽  
Yassine Souilmi

Xu and colleagues (Xu et al., 2021) recently suggested a new parameterisation of BWA-mem (Li, 2013) as an alternative to the current standard BWA-aln (Li and Durbin, 2009) to process ancient DNA sequencing data. The authors tested several combinations of the -k and -r parameters to optimise BWA-mem performance with degraded and contaminated ancient DNA samples. They report that using BWA-mem with -k 19 -r 2.5 parameters results in a mapping efficiency comparable to BWA-aln with -I 1024 -n 0.03 (i.e. a derivation of the standard parameters used in ancient DNA studies; (Schubert et al., 2012)), while achieving significantly faster run times. We recently performed a systematic benchmark of four mapping software (i.e. BWA-aln, BWA-mem, NovoAlign (http://www.novocraft.com/products/novoalign), and Bowtie2 (Langmead and Salzberg, 2012) for ancient DNA sequencing data and quantified their precision, accuracy, specificity, and impact on reference bias (Oliva et al., 2021). Notably, while multiple parameterisations were tested for BWA-aln, NovoAlign, and Bowtie2, we only tested BWA-mem with default parameters. Here, we use the alignment performance metrics from Oliva et al. to directly compare the recommended BWA-mem parameterisation reported in Xu et al. with the best performing alignment methods determined in the Oliva et al. benchmarks, and we make recommendations based on the results.


Author(s):  
Adrien Oliva ◽  
Raymond Tobler ◽  
Alan Cooper ◽  
Bastien Llamas ◽  
Yassine Souilmi

Abstract The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA ‘reads’) against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30–80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software—BWA-aln, BWA-mem, NovoAlign and Bowtie2—and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.


2021 ◽  
Author(s):  
Jonas A. Sibbesen ◽  
Jordan M. Eizenga ◽  
Adam M. Novak ◽  
Jouni Sirén ◽  
Xian Chang ◽  
...  

AbstractPangenomics is emerging as a powerful computational paradigm in bioinformatics. This field uses population-level genome reference structures, typically consisting of a sequence graph, to mitigate reference bias and facilitate analyses that were challenging with previous reference-based methods. In this work, we extend these methods into transcriptomics to analyze sequencing data using the pantranscriptome: a population-level transcriptomic reference. Our novel toolchain can construct spliced pangenome graphs, map RNA-seq data to these graphs, and perform haplotype-aware expression quantification of transcripts in a pantranscriptome. This workflow improves accuracy over state-of-the-art RNA-seq mapping methods, and it can efficiently quantify haplotype-specific transcript expression without needing to characterize a sample’s haplotypes beforehand.


2021 ◽  
Author(s):  
Chirag Jain ◽  
Neda Tavakoli ◽  
Srinivas Aluru

AbstractMotivationVariation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping.ResultsIn this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants (SNPs, indels), and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% indel structural variants can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis.Implementationhttps://github.com/at-cg/[email protected], [email protected], [email protected]


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Nae-Chyun Chen ◽  
Brad Solomon ◽  
Taher Mun ◽  
Sheila Iyer ◽  
Ben Langmead

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.


2021 ◽  
Vol 13 (1) ◽  
pp. 105-132
Author(s):  
Todd Elder ◽  
Yuqing Zhou

Using two nationally representative datasets, we find large differences between Black and White children in teacher-reported measures of noncognitive skills. We show that teacher reports understate true Black-White skill gaps because of reference bias: teachers appear to rate children relative to others in the same school, and Black students have lower-skilled classmates on average than do White students. We pursue three approaches to addressing these reference biases. Each approach nearly doubles the estimated Black-White gaps in noncognitive skills, to roughly 0.9 standard deviations in third grade. (JEL I21, I26, J13, J15, J24)


Sign in / Sign up

Export Citation Format

Share Document