scholarly journals SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark

Genes ◽  
2020 ◽  
Vol 11 (1) ◽  
pp. 53
Author(s):  
Zaid Al-Ars ◽  
Saiyi Wang ◽  
Hamid Mushtaq

The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.

2013 ◽  
Vol 2013 ◽  
pp. 1-8 ◽  
Author(s):  
Shanrong Zhao ◽  
Kurt Prenger ◽  
Lance Smith

RNA-Seq is becoming a promising replacement to microarrays in transcriptome profiling and differential gene expression study. Technical improvements have decreased sequencing costs and, as a result, the size and number of RNA-Seq datasets have increased rapidly. However, the increasing volume of data from large-scale RNA-Seq studies poses a practical challenge for data analysis in a local environment. To meet this challenge, we developed Stormbow, a cloud-based software package, to process large volumes of RNA-Seq data in parallel. The performance of Stormbow has been tested by practically applying it to analyse 178 RNA-Seq samples in the cloud. In our test, it took 6 to 8 hours to process an RNA-Seq sample with 100 million reads, and the average cost was $3.50 per sample. Utilizing Amazon Web Services as the infrastructure for Stormbow allows us to easily scale up to handle large datasets with on-demand computational resources. Stormbow is a scalable, cost effective, and open-source based tool for large-scale RNA-Seq data analysis. Stormbow can be freely downloaded and can be used out of box to process Illumina RNA-Seq datasets.


2017 ◽  
Vol 2 ◽  
pp. 6 ◽  
Author(s):  
Laura Oikkonen ◽  
Stefano Lise

Identifying variants from RNA-seq (transcriptome sequencing) data is a cost-effective and versatile alternative to whole-genome sequencing. However, current variant callers do not generally behave well with RNA-seq data due to reads encompassing intronic regions. We have developed a software programme called Opossum to address this problem. Opossum pre-processes RNA-seq reads prior to variant calling, and although it has been designed to work specifically with Platypus, it can be used equally well with other variant callers such as GATK HaplotypeCaller. In this work, we show that using Opossum in conjunction with either Platypus or GATK HaplotypeCaller maintains precision and improves the sensitivity for SNP detection compared to the GATK Best Practices pipeline. In addition, using it in combination with Platypus offers a substantial reduction in run times compared to the GATK pipeline so it is ideal when there are only limited time or computational resources available.


2019 ◽  
Vol 255 ◽  
pp. 01003 ◽  
Author(s):  
Jameel Syed Muslim ◽  
Hashmani Manzoor Ahmed ◽  
Rizvi Syed Sajjad Hussain ◽  
Uddin Vali ◽  
Rehman Mobashar

The cutting-edge technology Machine Learning (ML) is successfully applied for Business Intelligence. Among the various pre-processing steps of ML, Automatic Image Annotation (also known as automatic image tagging or linguistic indexing) is the process in which a computer system automatically assigns metadata in the form of captioning or keywords to a digital image. Automatic Image Annotation (AIA) methods (which have appeared during the last several years) make a large use of many ML approaches. Clustering and classification methods are most frequently applied to annotate images. In addition, these proposed solutions require a high computational infrastructure. However, certain real-time applications (small and ad-hoc intelligent applications) for example, autonomous small robots, gadgets, drone etc. have limited computational processing capacity. These small and ad-hoc applications demand a more dynamic and portable way to automatically annotate data and then perform ML tasks (Classification, clustering etc.) in real time using limited computational power and hardware resources. Through a comprehensive literature study we found that most image pre-processing algorithms and ML tasks are computationally intensive, and it can be challenging to run them on an embedded platform with acceptable frame rates. However, Raspberry Pi is sufficient for AIA and ML tasks that are relevant to small and ad-hoc intelligent applications. In addition, few critical intelligent applications (which require high computational resources, for example, Deep Learning using huge dataset) are only feasible to run on more powerful hardware resources. In this study, we present the framework of “Automatic Image Annotation for Small and Ad-hoc Intelligent Application using Raspberry Pi” and propose the low-cost infrastructures (single node and multi node using Raspberry Pi) and software module (for Raspberry Pi) to perform AIA and ML tasks in real time for small and ad-hoc intelligent applications. The integration of both AIA and ML tasks in a single software module (with in Raspberry Pi) is challenging. This study will helpful towards the improvement in various practical applications areas relevant to small intelligent autonomous systems.


Author(s):  
William C. Regli ◽  
Satyandra K. Gupta ◽  
Dana S. Nau

Abstract The availability of low-cost computational power is enabling development of increasingly sophisticated CAD software. Automation of design and manufacturing activities poses many difficult computational problems. Design is an interactive process and speed is a critical factor in systems that enable designers to explore and experiment with alternative ideas. As more downstream manufacturing activities are considered during the design phase, computational costs become problematic. Achieving interactivity requires a sophisticated allocation of computational resources in order to perform realistic design analyses and generate feedback in real time. This paper presents our initial efforts to use distributed algorithms to recognize machining features from solid models of parts with large numbers of features and many geometric and topological entities. Our goal is to outline how significant improvements in computation time can be obtained using existing hardware and software tools. An implementation of our approach is discussed.


2018 ◽  
Author(s):  
Travis Oenning ◽  
Taejeong Bae ◽  
Aravind Iyengar ◽  
Barrett Brickner ◽  
Madushanka Soysa ◽  
...  

Application of assembly methods for personal genome analysis from next generation sequencing data has been limited by the requirement for an expensive supercomputer hardware or long computation times when using ordinary resources. We describe CompStor Novos, achieving supercomputer-class performance in de novo assembly computation time on standard server hardware, based on a tiered-memory algorithm. Run on commercial off-the-shelf servers, Novos assembly is more precise and 10-20 times faster than that of existing assembly algorithms. Furthermore, we integrated Novos into a variant calling pipeline and demonstrate that both compute times and precision of calling point variants and indels compare well with standard alignment-based pipelines. Additionally, assembly eliminates bias in the estimation of allele frequency for indels and naturally enables discovery of breakpoints for structural variants with base pair resolution. Thus, Novos bridges the gap between alignment-based and assembly-based genome analyses. Extension and adaption of its underlying algorithm will help quickly and fully harvest information in sequencing reads for personal genome reconstruction.


2017 ◽  
Vol 2 ◽  
pp. 6 ◽  
Author(s):  
Laura Oikkonen ◽  
Stefano Lise

RNA-seq (transcriptome sequencing) is primarily considered a method of gene expression analysis but it can also be used to detect DNA variants in expressed regions of the genome. However, current variant callers do not generally behave well with RNA-seq data due to reads encompassing intronic regions. We have developed a software programme called Opossum to address this problem. Opossum pre-processes RNA-seq reads prior to variant calling, and although it has been designed to work specifically with Platypus, it can be used equally well with other variant callers such as GATK HaplotypeCaller. In this work, we show that using Opossum in conjunction with either Platypus or GATK HaplotypeCaller maintains precision and improves the sensitivity for SNP detection compared to the GATK Best Practices pipeline. In addition, using it in combination with Platypus offers a substantial reduction in run times compared to the GATK pipeline so it is ideal when there are only limited time or computational resources available.


2016 ◽  
Author(s):  
Andrian Yang ◽  
Michael Troup ◽  
Peijie Lin ◽  
Joshua W. K. Ho

AbstractSummarySingle-cell RNA-seq (scRNA-seq) is increasingly used in a range of biomedical studies. Nonetheless, current RNA-seq analysis tools are not specifically designed to efficiently process scRNA-seq data due to their limited scalability. Here we introduce Falco, a cloud-based framework to enable paralellisation of existing RNA-seq processing pipelines using big data technologies of Apache Hadoop and Apache Spark for performing massively parallel analysis of large scale transcriptomic data. Using two public scRNA-seq data sets and two popular RNA-seq alignment/feature quantification pipelines, we show that the same processing pipeline runs 2.6 – 145.4 times faster using Falco than running on a highly optimised single node analysis. Falco also allows user to the utilise low-cost spot instances of Amazon Web Services (AWS), providing a 65% reduction in cost of analysis.AvailabilityFalco is available via a GNU General Public License at https://github.com/VCCRI/Falco/[email protected] informationSupplementary data are available at BioRXiv online.


GigaScience ◽  
2021 ◽  
Vol 10 (9) ◽  
Author(s):  
Tanveer Ahmad ◽  
Zaid Al Ars ◽  
H Peter Hofstee

Abstract Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely integrate existing single-node pre-processing and variant-calling applications. Using Apache Spark just for distributing/scheduling data among loosely coupled applications or using I/O-based storage for storing the output of intermediate applications does not exploit the full benefit of Apache Spark in-memory processing. To achieve this, we propose a native Spark-based workflow that uses Python and Apache Arrow to enable efficient transfer of data between different workflow stages. This benefits from the ease of programmability of Python and the high efficiency of Arrow’s columnar in-memory data transformations. Results Here we present a scalable, parallel, and efficient implementation of next-generation sequencing data pre-processing and variant-calling workflows. Our design tightly integrates most pre-processing workflow stages, using Spark built-in functions to sort reads by coordinates and mark duplicates efficiently. Our approach outperforms state-of-the-art implementations by >2 times for the pre-processing stages, creating a scalable and high-performance solution for DeepVariant for both CPU-only and CPU + GPU clusters. Conclusions We show the feasibility and easy scalability of our approach to achieve high performance and efficient resource utilization for variant-calling analysis on high-performance computing clusters using the standardized Apache Arrow data representations. All codes, scripts, and configurations used to run our implementations are publicly available and open sourced; see https://github.com/abs-tudelft/variant-calling-at-scale.


2016 ◽  
Author(s):  
Arnald Alonso ◽  
Brittany N. Lasseigne ◽  
Kelly Williams ◽  
Josh Nielsen ◽  
Ryne C. Ramaker ◽  
...  

AbstractSummaryThe wide range of RNA-seq applications and their high computational needs require the development of pipelines orchestrating the entire workflow and optimizing usage of available computational resources. We present aRNApipe, a project-oriented pipeline for processing of RNA-seq data in high performance cluster environments. aRNApipe is highly modular and can be easily migrated to any high performance computing (HPC) environment. The current applications included in aRNApipe combine the essential RNA-seq primary analyses, including quality control metrics, transcript alignment, count generation, transcript fusion identification, alternative splicing, and sequence variant calling. aRNApipe is project-oriented and dynamic so users can easily update analyses to include or exclude samples or enable additional processing modules. Workflow parameters are easily set using a single configuration file that provides centralized tracking of all analytical processes. Finally, aRNApipe incorporates interactive web reports for sample tracking and a tool for managing the genome assemblies available to perform an analysis.Availability and documentationhttps://github.com/HudsonAlpha/aRNAPipe; DOI:10.5281/[email protected] informationSupplementary data are available at Bioinformatics online.


GigaScience ◽  
2022 ◽  
Vol 11 (1) ◽  
Author(s):  
Dries Decap ◽  
Louise de Schaetzen van Brienen ◽  
Maarten Larmuseau ◽  
Pascal Costanza ◽  
Charlotte Herzeel ◽  
...  

Abstract Background The accurate detection of somatic variants from sequencing data is of key importance for cancer treatment and research. Somatic variant calling requires a high sequencing depth of the tumor sample, especially when the detection of low-frequency variants is also desired. In turn, this leads to large volumes of raw sequencing data to process and hence, large computational requirements. For example, calling the somatic variants according to the GATK best practices guidelines requires days of computing time for a typical whole-genome sequencing sample. Findings We introduce Halvade Somatic, a framework for somatic variant calling from DNA sequencing data that takes advantage of multi-node and/or multi-core compute platforms to reduce runtime. It relies on Apache Spark to provide scalable I/O and to create and manage data streams that are processed on different CPU cores in parallel. Halvade Somatic contains all required steps to process the tumor and matched normal sample according to the GATK best practices recommendations: read alignment (BWA), sorting of reads, preprocessing steps such as marking duplicate reads and base quality score recalibration (GATK), and, finally, calling the somatic variants (Mutect2). Our approach reduces the runtime on a single 36-core node to 19.5 h compared to a runtime of 84.5 h for the original pipeline, a speedup of 4.3 times. Runtime can be further decreased by scaling to multiple nodes, e.g., we observe a runtime of 1.36 h using 16 nodes, an additional speedup of 14.4 times. Halvade Somatic supports variant calling from both whole-genome sequencing and whole-exome sequencing data and also supports Strelka2 as an alternative or complementary variant calling tool. We provide a Docker image to facilitate single-node deployment. Halvade Somatic can be executed on a variety of compute platforms, including Amazon EC2 and Google Cloud. Conclusions To our knowledge, Halvade Somatic is the first somatic variant calling pipeline that leverages Big Data processing platforms and provides reliable, scalable performance. Source code is freely available.


Sign in / Sign up

Export Citation Format

Share Document