Workstation benchmark of Spark Capable Genome Analysis ToolKit 4 Variant Calling

AbstractBackgroundRapid and practical DNA-sequencing processing has become essential for modern biomedical laboratories, especially in the field of cancer, pathology and genetics. While sequencing turn-over time has been, and still is, a bottleneck in research and diagnostics, the field of bioinformatics is moving at a rapid pace – both in terms of hardware and software development. Here, we benchmarked the local performance of three of the most important Spark-enabled Genome analysis toolkit 4 (GATK4) tools in a targeted sequencing workflow: Duplicate marking, base quality score recalibration (BQSR) and variant calling on targeted DNA sequencing using a modest hyperthreading 12-core single CPU and a high-speed PCI express solid-state drive.ResultsCompared to the previous GATK version the performance of Spark-enabled BQSR and HaplotypeCaller is shifted towards a more efficient usage of the available cores on CPU and outperforms the earlier GATK3.8 version with an order of magnitude reduction in processing time to analysis ready variants, whereas MarkDuplicateSpark was found to be thrice as fast. Furthermore, HaploTypeCallerSpark and BQSRPipelineSpark were significantly faster than the equivalent GATK4 standard tools with a combined ∼86% reduction in execution time, reaching a median rate of ten million processed bases per second, and duplicate marking was reduced ∼42%. The called variants were found to be in close agreement between the Spark and non-Spark versions, with an overall concordance of 98%. In this setup, the tools were also highly efficient when compared execution on a small 72 virtual CPU/18-node Google Cloud cluster.ConclusionIn conclusion, GATK4 offers practical parallelization possibilities for DNA sequence processing, and the Spark-enabled tools optimize performance and utilization of local CPUs. Spark utilizing GATK variant calling is several times faster than previous GATK3.8 multithreading with the same multi-core, single CPU, configuration. The improved opportunities for parallel computations not only hold implications for high-performance cluster, but also for modest laboratory or research workstations for targeted sequencing analysis, such as exome, panel or amplicon sequencing.

Download Full-text

PipeMEM: A Framework to Speed Up BWA-MEM in Spark with Low Overhead

Genes ◽

10.3390/genes10110886 ◽

2019 ◽

Vol 10 (11) ◽

pp. 886 ◽

Cited By ~ 1

Author(s):

Lingqi Zhang ◽

Cheng Liu ◽

Shoubin Dong

Keyword(s):

Genome Analysis ◽

High Speed ◽

High Performance ◽

Genome Alignment ◽

Single Node ◽

Genome Data ◽

Dna Sequence Alignment ◽

Alignment Tool ◽

Genome Analysis Toolkit ◽

Node Solution

(1) Background: DNA sequence alignment process is an essential step in genome analysis. BWA-MEM has been a prevalent single-node tool in genome alignment because of its high speed and accuracy. The exponentially generated genome data requiring a multi-node solution to handle large volumes of data currently remains a challenge. Spark is a ubiquitous big data platform that has been exploited to assist genome alignment in handling this challenge. Nonetheless, existing works that utilize Spark to optimize BWA-MEM suffer from higher overhead. (2) Methods: In this paper, we presented PipeMEM, a framework to accelerate BWA-MEM with lower overhead with the help of the pipe operation in Spark. We additionally proposed to use a pipeline structure and in-memory-computation to accelerate PipeMEM. (3) Results: Our experiments showed that, on paired-end alignment tasks, our framework had low overhead. In a multi-node environment, our framework, on average, was 2.27× faster compared with BWASpark (an alignment tool in Genome Analysis Toolkit (GATK)), and 2.33× faster compared with SparkBWA. (4) Conclusions: PipeMEM could accelerate BWA-MEM in the Spark environment with high performance and low overhead.

Download Full-text

Big Data Analytics Framework for Real-Time Genome Analysis: A Comprehensive Approach

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.8302 ◽

2019 ◽

Vol 16 (8) ◽

pp. 3419-3427

Author(s):

Shishir K. Shandilya ◽

S. Sountharrajan ◽

Smita Shandilya ◽

E. Suganya

Keyword(s):

Big Data ◽

Real Time ◽

Genome Analysis ◽

High Speed ◽

High Performance ◽

Big Data Analytics ◽

Good Precision ◽

Entire Genome ◽

Big Data Technologies ◽

Sequencing Platforms

Big Data Technologies are well-accepted in the recent years in bio-medical and genome informatics. They are capable to process gigantic and heterogeneous genome information with good precision and recall. With the quick advancements in computation and storage technologies, the cost of acquiring and processing the genomic data has decreased significantly. The upcoming sequencing platforms will produce vast amount of data, which will imperatively require high-performance systems for on-demand analysis with time-bound efficiency. Recent bio-informatics tools are capable of utilizing the novel features of Hadoop in a more flexible way. In particular, big data technologies such as MapReduce and Hive are able to provide high-speed computational environment for the analysis of petabyte scale datasets. This has attracted the focus of bio-scientists to use the big data applications to automate the entire genome analysis. The proposed framework is designed over MapReduce and Java on extended Hadoop platform to achieve the parallelism of Big Data Analysis. It will assist the bioinformatics community by providing a comprehensive solution for Descriptive, Comparative, Exploratory, Inferential, Predictive and Causal Analysis on Genome data. The proposed framework is user-friendly, fully-customizable, scalable and fit for comprehensive real-time genome analysis from data acquisition till predictive sequence analysis.

Download Full-text

Sentieon DNA pipeline for variant detection - Software-only solution, over 20× faster than GATK 3.3 with identical results

10.7287/peerj.preprints.1672v1 ◽

2016 ◽

Cited By ~ 2

Author(s):

Jessica A. Weber ◽

Rafael Aldana ◽

Brendan D. Gallagher ◽

Jeremy S. Edwards

Keyword(s):

Dna Sequencing ◽

Genome Analysis ◽

Processing Speed ◽

Best Practice ◽

Secondary Analysis ◽

Benchmark Analysis ◽

Genome Analysis Toolkit ◽

Variant Detection ◽

Detection Software

Sentieon DNA Software is a suite of tools that allow running DNA sequencing secondary analysis pipelines. The Sentieon DNA Software produces results identical to the Genome Analysis Toolkit (GATK) Best Practice Workflow using HaplotypeCaller, with more than 20x increase in processing speed on the same hardware. This paper presents a benchmark analysis of both speed comparison and output concordance between using GATK and Sentieon DNA software on publically available datasets from the 100 genomes database.

Download Full-text

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

Genome Research ◽

10.1101/gr.107524.110 ◽

2010 ◽

Vol 20 (9) ◽

pp. 1297-1303 ◽

Cited By ~ 11694

Author(s):

A. McKenna ◽

M. Hanna ◽

E. Banks ◽

A. Sivachenko ◽

K. Cibulskis ◽

...

Keyword(s):

Dna Sequencing ◽

Genome Analysis ◽

Next Generation ◽

Sequencing Data ◽

Mapreduce Framework ◽

Next Generation Dna Sequencing ◽

Genome Analysis Toolkit

Download Full-text

Quasi-interpenetrating network formed by polyacrylamide and poly(N,N-dimethylacrylamide) used in high-performance DNA sequencing analysis by capillary electrophoresis

Electrophoresis ◽

10.1002/elps.200406162 ◽

2005 ◽

Vol 26 (1) ◽

pp. 126-136 ◽

Cited By ~ 17

Author(s):

Yanmei Wang ◽

Dehai Liang ◽

Qicong Ying ◽

Benjamin Chu

Keyword(s):

Capillary Electrophoresis ◽

Dna Sequencing ◽

High Performance ◽

Sequencing Analysis ◽

Interpenetrating Network ◽

Dna Sequencing Analysis

Download Full-text

Dadasnake, a Snakemake implementation of DADA2 to process amplicon sequencing data for microbial ecology

GigaScience ◽

10.1093/gigascience/giaa135 ◽

2020 ◽

Vol 9 (12) ◽

Author(s):

Christina Weißbecker ◽

Beatrix Schnabel ◽

Anna Heintz-Buschart

Keyword(s):

High Performance ◽

Expert Knowledge ◽

Amplicon Sequencing ◽

Marker Genes ◽

Sequencing Analysis ◽

Sequencing Data ◽

Rna Sequences ◽

Hand Off ◽

Sequencing Platforms ◽

Computational Resources

Abstract Background Amplicon sequencing of phylogenetic marker genes, e.g., 16S, 18S, or ITS ribosomal RNA sequences, is still the most commonly used method to determine the composition of microbial communities. Microbial ecologists often have expert knowledge on their biological question and data analysis in general, and most research institutes have computational infrastructures to use the bioinformatics command line tools and workflows for amplicon sequencing analysis, but requirements of bioinformatics skills often limit the efficient and up-to-date use of computational resources. Results We present dadasnake, a user-friendly, 1-command Snakemake pipeline that wraps the preprocessing of sequencing reads and the delineation of exact sequence variants by using the favorably benchmarked and widely used DADA2 algorithm with a taxonomic classification and the post-processing of the resultant tables, including hand-off in standard formats. The suitability of the provided default configurations is demonstrated using mock community data from bacteria and archaea, as well as fungi. Conclusions By use of Snakemake, dadasnake makes efficient use of high-performance computing infrastructures. Easy user configuration guarantees flexibility of all steps, including the processing of data from multiple sequencing platforms. It is easy to install dadasnake via conda environments. dadasnake is available at https://github.com/a-h-b/dadasnake.

Download Full-text

INSIDE INDUSTRY

Asia-Pacific Biotech News ◽

10.1142/s0219030316000434 ◽

2016 ◽

Vol 20 (06) ◽

pp. 45-53

Keyword(s):

Genome Analysis ◽

Liquid Biopsy ◽

High Performance ◽

Compressed Air ◽

Opioid Overdose ◽

Medical Group ◽

Private Healthcare ◽

Broad Institute ◽

Genome Analysis Toolkit ◽

Unit Dose

APTAR PHARMA Provides Unit-Dose Nasal Spray Technology for Treatment of Opioid Overdose Cloudera, Broad Institute Collaborate on the Next Generation of the Genome Analysis Toolkit Singapore-based Luye Medical Group Completes Acquisition of Healthe Care, Australia's Third Largest Private Healthcare Group FEI Launches Apreo – Industry-Leading Versatile, High-Performance SEM BOGE Publishes New Guide on Specifying Compressed Air for Healthcare Takara Bio USA, Inc. and Integrated DNA Technologies Announce Collaboration to Support Targeted RNA Sequencing Pelican BioThermal Announces Launch of New Asia Headquarters in Singapore A Faster Way to Separate Proteins with Electrophoresis Biosensors Announces Strategic Agreement with Cardinal Health BGI and Clearbridge BioMedics Partner to Develop China CTC Liquid Biopsy Market towards Precision Medicine

Download Full-text

Scale-up development of high-performance polymer matrix for DNA sequencing analysis

Electrophoresis ◽

10.1002/elps.200600299 ◽

2006 ◽

Vol 27 (19) ◽

pp. 3712-3723 ◽

Cited By ~ 6

Author(s):

Fen Wan ◽

Weidong He ◽

Jun Zhang ◽

Qicong Ying ◽

Benjamin Chu

Keyword(s):

Dna Sequencing ◽

Polymer Matrix ◽

High Performance ◽

Scale Up ◽

Sequencing Analysis ◽

High Performance Polymer ◽

Dna Sequencing Analysis

Download Full-text

dadasnake, a Snakemake implementation of DADA2 to process amplicon sequencing data for microbial ecology

10.1101/2020.05.17.095679 ◽

2020 ◽

Author(s):

Christina Weiβbecker ◽

Beatrix Schnabel ◽

Anna Heintz-Buschart

Keyword(s):

High Performance ◽

Expert Knowledge ◽

Amplicon Sequencing ◽

Marker Genes ◽

Sequencing Analysis ◽

Sequencing Data ◽

Hand Off ◽

Sequencing Platforms ◽

Computational Resources ◽

User Friendly

AbstractBackgroundAmplicon sequencing of phylogenetic marker genes, e.g. 16S, 18S or ITS rRNA sequences, is still the most commonly used method to determine the composition of microbial communities. Microbial ecologists often have expert knowledge on their biological question and data analysis in general, and most research institutes have computational infrastructures to employ the bioinformatics command line tools and workflows for amplicon sequencing analysis, but requirements of bioinformatics skills often limit the efficient and up-to-date use of computational resources.Resultsdadasnake wraps pre-processing of sequencing reads, delineation of exact sequence variants using the favorably benchmarked, widely-used the DADA2 algorithm, taxonomic classification and post-processing of the resultant tables, and hand-off in standard formats, into a user-friendly, one-command Snakemake pipeline. The suitability of the provided default configurations is demonstrated using mock-community data from bacteria and archaea, as well as fungi.ConclusionsBy use of Snakemake, dadasnake makes efficient use of high-performance computing infrastructures. Easy user configuration guarantees flexibility of all steps, including the processing of data from multiple sequencing platforms. dadasnake facilitates easy installation via conda environments. dadasnake is available at https://github.com/a-h-b/dadasnake.

Download Full-text

Computational performance and accuracy of Sentieon DNASeq variant calling workflow

10.1101/396325 ◽

2018 ◽

Cited By ~ 3

Author(s):

Katherine I. Kendig ◽

Saurabh Baheti ◽

Matthew A. Bockol ◽

Travis M. Drucker ◽

Steven N. Hart ◽

...

Keyword(s):

Genome Sequencing ◽

Genome Analysis ◽

Variant Calling ◽

Single Sample ◽

Computational Performance ◽

Software Packages ◽

Broad Institute ◽

Accepted Standard ◽

Genome Analysis Toolkit ◽

Alternative Solutions

AbstractAs reliable, efficient genome sequencing becomes more ubiquitous, the need for similarly reliable and efficient variant calling becomes increasingly important. The Genome Analysis Toolkit (GATK), maintained by the Broad Institute, is currently the widely accepted standard for variant calling software. However, alternative solutions may provide faster variant calling without sacrificing accuracy. One such alternative is Sentieon DNASeq, a toolkit analogous to GATK but built on a highly optimized backend. We evaluated the DNASeq single-sample variant calling pipeline in comparison to that of GATK. Our results confirm the near-identical accuracy of the two software packages, showcase perfect scalability and great speed from Sentieon, and describe computational performance considerations for the deployment of Sentieon DNASeq.

Download Full-text