High performance of a GPU-accelerated variant calling tool in genome data analysis

Rapid advances in next-generation sequencing (NGS) have facilitated ultralarge population and cohort studies that utilized whole-genome sequencing (WGS) to identify DNA variants that may impact gene function. Massive sequencing data require highly efficient bioinformatics tools to complete read alignment and variant calling as the fundamental analysis. Multiple software and hardware acceleration strategies have been developed to boost the analysis speed. This study comprehensively evaluated the germline variant calling of a GPU-based acceleration tool, BaseNumber, using WGS datasets from several sources, including gold-standard samples from the Genome in a Bottle (GIAB) project and the Golden Standard of China Genome (GSCG) project, resequenced GSCG samples, and 100 in-house samples from the China Deafness Genetics Consortium (CDGC) project. Sequencing data were analyzed on the GPU server using BaseNumber, the variant calling outputs of which were compared to the reference VCF or the results generated by the Burrows-Wheeler Aligner (BWA) + Genome Analysis Toolkit (GATK) pipeline on a generic CPU server. BaseNumber demonstrated high precision (99.32%) and recall (99.86%) rates in variant calls compared to the standard reference. The variant calling outputs of the BaseNumber and GATK pipelines were very similar, with a mean F1 of 99.69%. Additionally, BaseNumber took only 23 minutes on average to analyze a 48X WGS sample, which was 215.33 times shorter than the GATK workflow. The GPU-based BaseNumber provides a highly accurate and ultrafast variant calling capability, significantly improving the WGS analysis efficiency and facilitating time-sensitive tests, such as clinical WGS genetic diagnosis, and sheds light on the GPU-based acceleration of other omics data analyses.

Download Full-text

Accurate fetal variant calling in the presence of maternal cell contamination

10.1101/552414 ◽

2019 ◽

Cited By ~ 1

Author(s):

Elena Nabieva ◽

Satyarth Mishra Sharma ◽

Yermek Kapushev ◽

Sofya K. Garushyants ◽

Anna V. Fedotova ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Chorionic Villus ◽

Genetic Diagnosis ◽

Variant Calling ◽

Data Availability ◽

Training Data ◽

Sequencing Data ◽

Maternal Cell ◽

Fetal Dna

AbstractHigh-throughput sequencing of fetal DNA is a promising and increasingly common method for the discovery of all (or all coding) genetic variants in the fetus, either as part of prenatal screening or diagnosis, or for genetic diagnosis of spontaneous abortions. In many cases, the fetal DNA (from chorionic villi, amniotic fluid, or abortive tissue) can be contaminated with maternal cells, resulting in the mixture of fetal and maternal DNA. This maternal cell contamination (MCC) undermines the assumption, made by traditional variant callers, that each allele in a heterozygous site is covered, on average, by 50% of the reads, and therefore can lead to erroneous genotype calls. We present a panel of methods for reducing the genotyping error in the presence of MCC. All methods start with the output of GATK HaplotypeCaller on the sequencing data for the (contaminated) fetal sample and both of its parents, and additionally rely on information about the MCC fraction (which itself is readily estimated from the high-throughput sequencing data). The first of these methods uses a Bayesian probabilistic model to correct the fetal genotype calls produced by MCC-unaware HaplotypeCaller. The other two methods “learn” the genotype-correction model from examples. We use simulated contaminated fetal data to train and test the models. Using the test sets, we show that all three methods lead to substantially improved accuracy when compared with the original MCC-unaware HaplotypeCaller calls. We then apply the best-performing method to three chorionic villus samples from spontaneously terminated pregnancies.Code and training data availabilityhttps://github.com/bazykinlab/ML-maternal-cell-contamination

Download Full-text

Workstation benchmark of Spark Capable Genome Analysis ToolKit 4 Variant Calling

10.1101/2020.05.17.101105 ◽

2020 ◽

Author(s):

Marcus H. Hansen ◽

Anita T. Simonsen ◽

Hans B. Ommen ◽

Charlotte G. Nyvold

Keyword(s):

Dna Sequencing ◽

Genome Analysis ◽

High Speed ◽

High Performance ◽

Variant Calling ◽

Amplicon Sequencing ◽

Targeted Sequencing ◽

Sequencing Analysis ◽

Genome Analysis Toolkit ◽

Order Of Magnitude

AbstractBackgroundRapid and practical DNA-sequencing processing has become essential for modern biomedical laboratories, especially in the field of cancer, pathology and genetics. While sequencing turn-over time has been, and still is, a bottleneck in research and diagnostics, the field of bioinformatics is moving at a rapid pace – both in terms of hardware and software development. Here, we benchmarked the local performance of three of the most important Spark-enabled Genome analysis toolkit 4 (GATK4) tools in a targeted sequencing workflow: Duplicate marking, base quality score recalibration (BQSR) and variant calling on targeted DNA sequencing using a modest hyperthreading 12-core single CPU and a high-speed PCI express solid-state drive.ResultsCompared to the previous GATK version the performance of Spark-enabled BQSR and HaplotypeCaller is shifted towards a more efficient usage of the available cores on CPU and outperforms the earlier GATK3.8 version with an order of magnitude reduction in processing time to analysis ready variants, whereas MarkDuplicateSpark was found to be thrice as fast. Furthermore, HaploTypeCallerSpark and BQSRPipelineSpark were significantly faster than the equivalent GATK4 standard tools with a combined ∼86% reduction in execution time, reaching a median rate of ten million processed bases per second, and duplicate marking was reduced ∼42%. The called variants were found to be in close agreement between the Spark and non-Spark versions, with an overall concordance of 98%. In this setup, the tools were also highly efficient when compared execution on a small 72 virtual CPU/18-node Google Cloud cluster.ConclusionIn conclusion, GATK4 offers practical parallelization possibilities for DNA sequence processing, and the Spark-enabled tools optimize performance and utilization of local CPUs. Spark utilizing GATK variant calling is several times faster than previous GATK3.8 multithreading with the same multi-core, single CPU, configuration. The improved opportunities for parallel computations not only hold implications for high-performance cluster, but also for modest laboratory or research workstations for targeted sequencing analysis, such as exome, panel or amplicon sequencing.

Download Full-text

Numt identification and removal with RtN!

Bioinformatics ◽

10.1093/bioinformatics/btaa642 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5115-5116 ◽

Cited By ~ 2

Author(s):

August E Woerner ◽

Jennifer Churchill Cihlar ◽

Utpal Smart ◽

Bruce Budowle

Keyword(s):

Mitochondrial Genome ◽

Massively Parallel Sequencing ◽

Sequence Similarity ◽

Variant Calling ◽

Supplementary Information ◽

Mitochondrial Genomes ◽

Sequencing Data ◽

Read Mapping ◽

Genome Data ◽

Mitochondrial Sequences

Abstract Motivation Assays in mitochondrial genomics rely on accurate read mapping and variant calling. However, there are known and unknown nuclear paralogs that have fundamentally different genetic properties than that of the mitochondrial genome. Such paralogs complicate the interpretation of mitochondrial genome data and confound variant calling. Results Remove the Numts! (RtN!) was developed to categorize reads from massively parallel sequencing data not based on the expected properties and sequence identities of paralogous nuclear encoded mitochondrial sequences, but instead using sequence similarity to a large database of publicly available mitochondrial genomes. RtN! removes low-level sequencing noise and mitochondrial paralogs while not impacting variant calling, while competing methods were shown to remove true variants from mitochondrial mixtures. Availability and implementation https://github.com/Ahhgust/RtN Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detailed comparison of two popular variant calling packages for exome and targeted exon studies

10.7287/peerj.preprints.403v2 ◽

2014 ◽

Author(s):

Charles D Warden ◽

Aaron W Adamson ◽

Susan L Neuhausen ◽

Xiwei Wu

Keyword(s):

Gene List ◽

Variant Calling ◽

Detailed Comparison ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

High Quality ◽

Genome Analysis Toolkit ◽

High Concordance ◽

The Impact ◽

Processing Steps

The Genome Analysis Toolkit (GATK) is often considered to be the “gold standard” for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but an adequate comparison against VarScan may have not yet been performed. More specifically, we compared four lists of variants called by GATK (using the UnifiedGenotyper and the HaplotypeCaller algorithms, with and without filtering low quality variants) and three lists of variants called using VarScan (with varying sets of parameters). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. We found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a high quality gene list, with high concordance (>97%) when compared to high quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84-88% of the high-quality GATK SNPs in the exome datasets. We also accessed the impact of pre-processing (e.g., indel realignment and quality score base recalibration using GATK). In most cases, these pre-processing steps had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. More broadly, we believe the metrics used for comparison in this study can be useful in accessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.

Download Full-text

Detailed comparison of two popular variant calling packages for exome and targeted exon studies

10.7287/peerj.preprints.403v3 ◽

2014 ◽

Author(s):

Charles D Warden ◽

Aaron W Adamson ◽

Susan L Neuhausen ◽

Xiwei Wu

Keyword(s):

Gene List ◽

Variant Calling ◽

Detailed Comparison ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

High Quality ◽

Genome Analysis Toolkit ◽

Comprehensive Comparison ◽

The Impact ◽

Processing Steps

The Genome Analysis Toolkit (GATK) is commonly used for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but an equally comprehensive comparison for VarScan not yet been performed. More specifically, we compared four lists of variants called by GATK (using the UnifiedGenotyper and the HaplotypeCaller algorithms, with and without filtering low quality variants) and three lists of variants called using VarScan (with varying sets of parameters). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. We found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a high quality gene list, with high concordance (>97%) when compared to high quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84-88% of the high-quality GATK SNPs in the exome datasets. We also assessed the impact of pre-processing (e.g., indel realignment and quality score base recalibration using GATK). In most cases, these pre-processing steps had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. More broadly, we believe the metrics used for comparison in this study can be useful in assessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.

Download Full-text

MitoFlex: an efficient, high-performance toolkit for animal mitogenome assembly, annotation, and visualization

Bioinformatics ◽

10.1093/bioinformatics/btab111 ◽

2021 ◽

Author(s):

Jun-Yu Li ◽

Wei-Xuan Li ◽

An-Tai Wang ◽

Zhang Yu

Keyword(s):

Mitochondrial Genome ◽

High Performance ◽

High Throughput Sequencing ◽

De Novo ◽

Supplementary Information ◽

Sequencing Data ◽

Protein Coding ◽

High Throughput Sequencing Data ◽

Genome Analysis Toolkit ◽

Overall Performance

Abstract Summary MitoFlex is a linux-based mitochondrial genome analysis toolkit, which provides a complete workflow of raw data filtering, de novo assembly, mitochondrial genome identification and annotation for animal high throughput sequencing data. The overall performance was compared between MitoFlex and its analogue MitoZ, in terms of protein coding gene recovery, memory consumption and processing speed. Availability MitoFlex is available at https://github.com/Prunoideae/MitoFlex under GPLv3 license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Detailed comparison of two popular variant calling packages for exome and targeted exon studies

10.7287/peerj.preprints.403v1 ◽

2014 ◽

Author(s):

Charles D Warden ◽

Aaron W Adamson ◽

Susan L Neuhausen ◽

Xiwei Wu

Keyword(s):

Gene List ◽

Variant Calling ◽

Detailed Comparison ◽

Nucleotide Polymorphisms ◽

Sequencing Data ◽

High Quality ◽

Genome Analysis Toolkit ◽

High Concordance ◽

The Impact ◽

Processing Steps

The Genome Analysis Toolkit (GATK) is often considered to be the “gold standard” for variant calling of single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels) from short-read sequencing data aligned against a reference genome. There have been a number of variant calling comparisons against GATK, but we felt that an adequate comparison against VarScan may have not yet been performed. More specifically, we compared four lists of variants called by GATK (using the UnifiedGenotyper and the HaplotypeCaller algorithms, with and without filtering low quality variants) and three lists of variants called using VarScan (with varying sets of parameters). Variant calling was performed on three datasets (1 targeted exon dataset and 2 exome datasets), each with approximately a dozen subjects. We found that running VarScan with a conservative set of parameters (referred to as “VarScan-Cons”) resulted in a high quality gene list, with high concordance (>97%) when compared to high quality variants called by the GATK UnifiedGenotyper and HaplotypeCaller. These conservative parameters result in decreased sensitivity, but the VarScan-Cons variant list could still recover 84-88% of the high-quality GATK SNPs in the exome datasets. We also accessed the impact of pre-processing (e.g., indel realignment and quality score base recalibration using GATK). In most cases, these pre-processing steps had only a modest impact on the variant calls, but the importance of the pre-processing steps varied between datasets and variant callers. More broadly, we believe the metrics used for comparison in this study can be useful in accessing the quality of variant calls in the context of a specific experimental design. As an example, a limited number of variant calling comparisons are also performed on two additional variant callers.

Download Full-text

PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead

Genes ◽

10.3390/genes10110886 ◽

2019 ◽

Vol 10 (11) ◽

pp. 886 ◽

Cited By ~ 1

Author(s):

Lingqi Zhang ◽

Cheng Liu ◽

Shoubin Dong

Keyword(s):

Genome Analysis ◽

High Speed ◽

High Performance ◽

Genome Alignment ◽

Single Node ◽

Genome Data ◽

Dna Sequence Alignment ◽

Alignment Tool ◽

Genome Analysis Toolkit ◽

Node Solution

(1) Background: DNA sequence alignment process is an essential step in genome analysis. BWA-MEM has been a prevalent single-node tool in genome alignment because of its high speed and accuracy. The exponentially generated genome data requiring a multi-node solution to handle large volumes of data currently remains a challenge. Spark is a ubiquitous big data platform that has been exploited to assist genome alignment in handling this challenge. Nonetheless, existing works that utilize Spark to optimize BWA-MEM suffer from higher overhead. (2) Methods: In this paper, we presented PipeMEM, a framework to accelerate BWA-MEM with lower overhead with the help of the pipe operation in Spark. We additionally proposed to use a pipeline structure and in-memory-computation to accelerate PipeMEM. (3) Results: Our experiments showed that, on paired-end alignment tasks, our framework had low overhead. In a multi-node environment, our framework, on average, was 2.27× faster compared with BWASpark (an alignment tool in Genome Analysis Toolkit (GATK)), and 2.33× faster compared with SparkBWA. (4) Conclusions: PipeMEM could accelerate BWA-MEM in the Spark environment with high performance and low overhead.

Download Full-text

QUARTIC: QUick pArallel algoRithms for high-Throughput sequencIng data proCessing

F1000Research ◽

10.12688/f1000research.22954.1 ◽

2020 ◽

Vol 9 ◽

pp. 240

Author(s):

Frédéric Jarlier ◽

Nicolas Joly ◽

Nicolas Fedy ◽

Thomas Magalhaes ◽

Leonor Sirotti ◽

...

Keyword(s):

High Throughput ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

High Throughput Sequencing ◽

Genome Structure ◽

Sequencing Data ◽

Genome Data ◽

High Throughput Sequencing Data ◽

Time To Delivery

Life science has entered the so-called ’big data era’ where biologists, clinicians and bioinformaticians are overwhelmed with unprecedented amount of data. High-throughput sequencing has revolutionized genomics and offers new insights to decipher the genome structure. However, using these data for daily clinical practice care and diagnosis purposes is challenging as the data are bigger and bigger. Therefore, we implemented software using Message Passing Interface such that the alignment and sorting of sequencing reads can easily scale on high-performance computing architecture. Our implementation makes it possible to reduce the time to delivery to few minutes, even on large whole-genome data using several hundreds of cores.

Download Full-text