Recalibration of mapping quality scores in Illumina short-read alignments improves SNP detection results in low-coverage sequencing data

Background Low-coverage sequencing is a cost-effective way to obtain reads spanning an entire genome. However, read depth at each locus is low, making sequencing error difficult to separate from actual variation. Prior to variant calling, sequencer reads are aligned to a reference genome, with alignments stored in Sequence Alignment/Map (SAM) files. Each alignment has a mapping quality (MAPQ) score indicating the probability a read is incorrectly aligned. This study investigated the recalibration of probability estimates used to compute MAPQ scores for improving variant calling performance in single-sample, low-coverage settings. Materials and Methods Simulated tomato, hot pepper and rice genomes were implanted with known variants. From these, simulated paired-end reads were generated at low coverage and aligned to the original reference genomes. Features extracted from the SAM formatted alignment files for tomato were used to train machine learning models to detect incorrectly aligned reads and output estimates of the probability of misalignment for each read in all three data sets. MAPQ scores were then re-computed from these estimates. Next, the SAM files were updated with new MAPQ scores. Finally, Variant calling was performed on the original and recalibrated alignments and the results compared. Results Incorrectly aligned reads comprised only 0.16% of the reads in the training set. This severe class imbalance required special consideration for model training. The F1 score for detecting misaligned reads ranged from 0.76 to 0.82. The best performing model was used to compute new MAPQ scores. Single Nucleotide Polymorphism (SNP) detection was improved after mapping score recalibration. In rice, recall for called SNPs increased by 5.2%, while for tomato and pepper it increased by 3.1% and 1.5%, respectively. For all three data sets the precision of SNP calls ranged from 0.91 to 0.95, and was largely unchanged both before and after mapping score recalibration. Conclusion Recalibrating MAPQ scores delivers modest improvements in single-sample variant calling results. Some variant callers operate on multiple samples simultaneously. They exploit every sample’s reads to compensate for the low read-depth of individual samples. This improves polymorphism detection and genotype inference. It may be that small improvements in single-sample settings translate to larger gains in a multi-sample experiment. A study to investigate this is ongoing.

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

Lacer: accurate base quality score recalibration for improving variant calling from next-generation sequencing data in any organism

10.1101/130732 ◽

2017 ◽

Author(s):

Jade C.S. Chung ◽

Swaine L. Chen

Keyword(s):

Next Generation Sequencing ◽

Variant Calling ◽

Quality Score ◽

Identification Accuracy ◽

Next Generation Sequencing Data ◽

Sequencing Error ◽

Next Generation ◽

Sequencing Data ◽

Base Quality Score ◽

Generation Sequencing

AbstractNext-generation sequencing data is accompanied by quality scores that quantify sequencing error. Inaccuracies in these quality scores propagate through all subsequent analyses; thus base quality score recalibration is a standard step in many next-generation sequencing workflows, resulting in improved variant calls. Current base quality score recalibration algorithms rely on the assumption that sequencing errors are already known; for human resequencing data, relatively complete variant databases facilitate this. However, because existing databases are still incomplete, recalibration is still inaccurate; and most organisms do not have variant databases, exacerbating inaccuracy for non-human data. To overcome these logical and practical problems, we introduce Lacer, which recalibrates base quality scores without assuming knowledge of correct and incorrect bases and without requiring knowledge of common variants. Lacer is the first logically sound, fully general, and truly accurate base recalibrator. Lacer enhances variant identification accuracy for resequencing data of human as well as other organisms (which are not accessible to current recalibrators), simultaneously improving and extending the benefits of base quality score recalibration to nearly all ongoing sequencing projects. Lacer is available at: https://github.com/swainechen/lacer.

Download Full-text

DECA: scalable XHMM exome copy-number variant calling with ADAM and Apache Spark

BMC Bioinformatics ◽

10.1186/s12859-019-3108-7 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 2

Author(s):

Michael D. Linderman ◽

Davin Chia ◽

Forrest Wallace ◽

Frank A. Nothaft

Keyword(s):

Copy Number ◽

Large Scale ◽

Variant Calling ◽

Copy Number Variant ◽

Read Depth ◽

Lessons Learned ◽

Apache Spark ◽

Sequencing Data ◽

Whole Exome Sequencing Data ◽

Genome Analyses

Abstract Background XHMM is a widely used tool for copy-number variant (CNV) discovery from whole exome sequencing data but can require hours to days to run for large cohorts. A more scalable implementation would reduce the need for specialized computational resources and enable increased exploration of the configuration parameter space to obtain the best possible results. Results DECA is a horizontally scalable implementation of the XHMM algorithm using the ADAM framework and Apache Spark that incorporates novel algorithmic optimizations to eliminate unneeded computation. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. We performed CNV discovery from the read-depth matrix in 2535 exomes in 9.3 min on a 16-core workstation (35.3× speedup vs. XHMM), 12.7 min using 10 executor cores on a Spark cluster (18.8× speedup vs. XHMM), and 9.8 min using 32 executor cores on Amazon AWS’ Elastic MapReduce. We performed CNV discovery from the original BAM files in 292 min using 640 executor cores on a Spark cluster. Conclusions We describe DECA’s performance, our algorithmic and implementation enhancements to XHMM to obtain that performance, and our lessons learned porting a complex genome analysis application to ADAM and Spark. ADAM and Apache Spark are a performant and productive platform for implementing large-scale genome analyses, but efficiently utilizing large clusters can require algorithmic optimizations and careful attention to Spark’s configuration parameters.

Download Full-text

A Cluster-Based Approach for the Discovery of Copy Number Variations From Next-Generation Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2021.699510 ◽

2021 ◽

Vol 12 ◽

Author(s):

Guojun Liu ◽

Junying Zhang

Keyword(s):

Next Generation Sequencing ◽

Copy Number ◽

Read Depth ◽

Copy Number Variations ◽

Experimental Results ◽

Data Sets ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing ◽

Cnv Detection

The next-generation sequencing technology offers a wealth of data resources for the detection of copy number variations (CNVs) at a high resolution. However, it is still challenging to correctly detect CNVs of different lengths. It is necessary to develop new CNV detection tools to meet this demand. In this work, we propose a new CNV detection method, called CBCNV, for the detection of CNVs of different lengths from whole genome sequencing data. CBCNV uses a clustering algorithm to divide the read depth segment profile, and assigns an abnormal score to each read depth segment. Based on the abnormal score profile, Tukey’s fences method is adopted in CBCNV to forecast CNVs. The performance of the proposed method is evaluated on simulated data sets, and is compared with those of several existing methods. The experimental results prove that the performance of CBCNV is better than those of several existing methods. The proposed method is further tested and verified on real data sets, and the experimental results are found to be consistent with the simulation results. Therefore, the proposed method can be expected to become a routine tool in the analysis of CNVs from tumor-normal matched samples.

Download Full-text

Comprehensive and accurate genetic variant identification from contaminated and low coverage Mycobacterium tuberculosis whole genome sequencing data

10.1101/2021.09.16.460612 ◽

2021 ◽

Author(s):

Tim H. Heupink ◽

Lennert Verboven ◽

Robin M. Warren ◽

Annelies Van Rie

Keyword(s):

Mycobacterium Tuberculosis ◽

Genetic Variants ◽

Sequence Data ◽

Variant Calling ◽

Sequencing Data ◽

Single Nucleotide ◽

Variant Filtering ◽

Contamination Levels ◽

Genomic Regions ◽

Low Coverage

AbstractImproved understanding of the genomic variants that allow Mycobacterium tuberculosis (Mtb) to acquire drug resistance, or tolerance, and increase its virulence are important factors in controlling the current tuberculosis epidemic. Current approaches to Mtb sequencing however cannot reveal Mtb’s full genomic diversity due to the strict requirements of low contamination levels, high Mtb sequence coverage, and elimination of complex regions.We developed the XBS (compleX Bacterial Samples) bioinformatics pipeline which implements joint calling and machine-learning-based variant filtering tools to specifically improve variant detection in the important Mtb samples that do not meet these criteria, such as those from unbiased sputum samples. Using novel simulated datasets, that permit exact accuracy verification, XBS was compared to the UVP and MTBseq pipelines. Accuracy statistics showed that all three pipelines performed equally well for sequence data that resemble those obtained from high depth coverage and low-level contamination culture isolates. In the complex genomic regions however, XBS accurately identified 9.0% more single nucleotide polymorphisms and 8.1% more single nucleotide insertions and deletions than the WHO-endorsed unified analysis variant pipeline. XBS also had superior accuracy for sequence data that resemble those obtained directly from sputum samples, where depth of coverage is typically very low and contamination levels are high. XBS was the only pipeline not affected by low depth of coverage (5-10×), type of contamination and excessive contamination levels (>50%). Simulation results were confirmed using WGS data from clinical samples, confirming the superior performance of XBS with a higher sensitivity (98.8%) when analysing culture isolates and identification of 13.9% more variable sites in WGS data from sputum samples as compared to MTBseq, without evidence for false positive variants when ribosomal RNA regions were excluded.The XBS pipeline facilitates sequencing of less-than-perfect Mtb samples. These advances will benefit future clinical applications of Mtb sequencing, especially whole genome sequencing directly from clinical specimens, thereby avoiding in vitro biases and making many more samples available for drug resistance and other genomic analyses. The additional genetic resolution and increased sample success rate will improve genome-wide association studies and sequence-based transmission studies.Impact statementMycobacterium tuberculosis (Mtb) DNA is usually extracted from culture isolates to obtain high quantities of non-contaminated DNA but this process can change the make-up of the bacterial population and is time-consuming. Furthermore, current analytic approaches exclude complex genomic regions where DNA sequences are repeated to avoid inference of false positive genetic variants, which may result in the loss of important genetic information.We designed the compleX Bacterial Sample (XBS) variant caller to overcome these limitations. XBS employs joint variant calling and machine-learning-based variant filtering to ensure that high quality variants can be inferred from low coverage and highly contaminated genomic sequence data obtained directly from sputum samples. Simulation and clinical data analyses showed that XBS performs better than other pipelines as it can identify more genetic variants and can handle complex (low depth, highly contaminated) Mtb samples. The XBS pipeline was designed to analyse Mtb samples but can easily be adapted to analyse other complex bacterial samples.Data summarySimulated sequencing data have been deposited in SRA BioProject PRJNA706121. All detailed findings are available in the Supplementary Material. Scripts for running the XBS variant calling core are available on https://github.com/TimHHH/XBS The authors confirm all supporting data, code and protocols have been provided within the article or through supplementary data files.

Download Full-text

Contamination as a major factor in poor Illumina assembly of microbial isolate genomes

10.1101/081885 ◽

2016 ◽

Cited By ~ 5

Author(s):

Haeyoung Jeong ◽

Jae-Goo Pan ◽

Seung-Hwan Park

Keyword(s):

Illumina Sequencing ◽

De Novo ◽

Repetitive Sequences ◽

Low Frequency ◽

Read Depth ◽

16S Rrna Genes ◽

Rrna Genes ◽

Sequencing Error ◽

Sequencing Data ◽

Long Reads

ABSTRACTThe nonhybrid hierarchical assembly of PacBio long reads is becoming the most preferred method for obtaining genomes for microbial isolates. On the other hand, among massive numbers of Illumina sequencing reads produced, there is a slim chance of re-evaluating failed microbial genome assembly (high contig number, large total contig size, and/or the presence of low-depth contigs). We generated Illumina-type test datasets with various levels of sequencing error, pretreatment (trimming and error correction), repetitive sequences, contamination, and ploidy from both simulated and real sequencing data and applied k-mer abundance analysis to quickly detect possible diagnostic signatures of poor assemblies. Contamination was the only factor leading to poor assemblies for the test dataset derived from haploid microbial genomes, resulting in an extraordinary peak within low-frequency k-mer range. When thirteen Illumina sequencing reads of microbes belonging to genera Bacillus or Paenibacillus from a single multiplexed run were subjected to a k-mer abundance analysis, all three samples leading to poor assemblies showed peculiar patterns of contamination. Read depth distribution along the contig length indicated that all problematic assemblies suffered from too many contigs with low average read coverage, where 1% to 15% of total reads were mapped to low-coverage contigs. We found that subsampling or filtering out reads having rare k-mers could efficiently remove low-level contaminants and greatly improve the de novo assemblies. An analysis of 16S rRNA genes recruited from reads or contigs and the application of read classification tools originally designed for metagenome analyses can help identify the source of a contamination. The unexpected presence of proteobacterial reads across multiple samples, which had no relevance to our lab environment, implies that such prevalent contamination might have occurred after the DNA preparation step, probably at the place where sequencing service was provided.

Download Full-text

Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data

DNA Research ◽

10.1093/dnares/dsx012 ◽

2017 ◽

Vol 24 (4) ◽

pp. 397-405 ◽

Cited By ~ 6

Author(s):

Masaaki Kobayashi ◽

Hajime Ohyanagi ◽

Hideki Takanashi ◽

Satomi Asano ◽

Toru Kudo ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

Snp Detection ◽

Detection Tool ◽

High Throughput Sequencing Data ◽

Highly Sensitive ◽

Low Coverage

Download Full-text

SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples

Genome Research ◽

10.1101/gr.113084.110 ◽

2010 ◽

Vol 21 (6) ◽

pp. 952-960 ◽

Cited By ~ 105

Author(s):

S. Q. Le ◽

R. Durbin

Keyword(s):

Sequencing Data ◽

Snp Detection ◽

Low Coverage

Download Full-text

Denoising of Aligned Genomic Data

Scientific Reports ◽

10.1038/s41598-019-51418-z ◽

2019 ◽

Vol 9 (1) ◽

Author(s):

Irena Fischer-Hwang ◽

Idoia Ochoa ◽

Tsachy Weissman ◽

Mikel Hernaez

Keyword(s):

State Of The Art ◽

Variant Calling ◽

Genomic Data ◽

Clinical Settings ◽

Data Sets ◽

Sequencing Data ◽

Denoising Method ◽

Data Set ◽

Genomic Data Analysis ◽

Variant Identification

Abstract Noise in genomic sequencing data is known to have effects on various stages of genomic data analysis pipelines. Variant identification is an important step of many of these pipelines, and is increasingly being used in clinical settings to aid medical practices. We propose a denoising method, dubbed SAMDUDE, which operates on aligned genomic data in order to improve variant calling performance. Denoising human data with SAMDUDE resulted in improved variant identification in both individual chromosome as well as whole genome sequencing (WGS) data sets. In the WGS data set, denoising led to identification of almost 2,000 additional true variants, and elimination of over 1,500 erroneously identified variants. In contrast, we found that denoising with other state-of-the-art denoisers significantly worsens variant calling performance. SAMDUDE is written in Python and is freely available at https://github.com/ihwang/SAMDUDE.

Download Full-text

Moss enables high sensitivity single-nucleotide variant calling from multiple bulk DNA tumor samples

Nature Communications ◽

10.1038/s41467-021-22466-9 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Chuanyi Zhang ◽

Mohammed El-Kebir ◽

Idoia Ochoa

Keyword(s):

Cancer Genomics ◽

Low Frequency ◽

Variant Calling ◽

High Sensitivity ◽

Single Sample ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Additional Time ◽

Single Nucleotide ◽

Multiple Samples

AbstractIntra-tumor heterogeneity renders the identification of somatic single-nucleotide variants (SNVs) a challenging problem. In particular, low-frequency SNVs are hard to distinguish from sequencing artifacts. While the increasing availability of multi-sample tumor DNA sequencing data holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample SNV callers that utilize these data. Here we report Moss, a method to identify low-frequency SNVs that recur in multiple sequencing samples from the same tumor. Moss provides any existing single-sample SNV caller the ability to support multiple samples with little additional time overhead. We demonstrate that Moss improves recall while maintaining high precision in a simulated dataset. On multi-sample hepatocellular carcinoma, acute myeloid leukemia and colorectal cancer datasets, Moss identifies new low-frequency variants that meet manual review criteria and are consistent with the tumor’s mutational signature profile. In addition, Moss detects the presence of variants in more samples of the same tumor than reported by the single-sample caller. Moss’ improved sensitivity in SNV calling will enable more detailed downstream analyses in cancer genomics.

Download Full-text