scholarly journals aRNApipe: A balanced, efficient and distributed pipeline for processing RNA-seq data in high performance computing environments

2016 ◽  
Author(s):  
Arnald Alonso ◽  
Brittany N. Lasseigne ◽  
Kelly Williams ◽  
Josh Nielsen ◽  
Ryne C. Ramaker ◽  
...  

AbstractSummaryThe wide range of RNA-seq applications and their high computational needs require the development of pipelines orchestrating the entire workflow and optimizing usage of available computational resources. We present aRNApipe, a project-oriented pipeline for processing of RNA-seq data in high performance cluster environments. aRNApipe is highly modular and can be easily migrated to any high performance computing (HPC) environment. The current applications included in aRNApipe combine the essential RNA-seq primary analyses, including quality control metrics, transcript alignment, count generation, transcript fusion identification, alternative splicing, and sequence variant calling. aRNApipe is project-oriented and dynamic so users can easily update analyses to include or exclude samples or enable additional processing modules. Workflow parameters are easily set using a single configuration file that provides centralized tracking of all analytical processes. Finally, aRNApipe incorporates interactive web reports for sample tracking and a tool for managing the genome assemblies available to perform an analysis.Availability and documentationhttps://github.com/HudsonAlpha/aRNAPipe; DOI:10.5281/[email protected] informationSupplementary data are available at Bioinformatics online.

2018 ◽  
Author(s):  
LM Simon ◽  
S Karg ◽  
AJ Westermann ◽  
M Engel ◽  
AHA Elbehery ◽  
...  

AbstractBackgroundWith the advent of the age of big data in bioinformatics, large volumes of data and high performance computing power enable researchers to perform re-analyses of publicly available datasets at an unprecedented scale. Ever more studies imply the microbiome in both normal human physiology and a wide range of diseases. RNA sequencing technology (RNA-seq) is commonly used to infer global eukaryotic gene expression patterns under defined conditions, including human disease-related contexts, but its generic nature also enables the detection of microbial and viral transcripts.FindingsWe developed a bioinformatic pipeline to screen existing human RNA-seq datasets for the presence of microbial and viral reads by re-inspecting the non-human-mapping read fraction. We validated this approach by recapitulating outcomes from 6 independent controlled infection experiments of cell line models and comparison with an alternative metatranscriptomic mapping strategy. We then applied the pipeline to close to 150 terabytes of publicly available raw RNA-seq data from >17,000 samples from >400 studies relevant to human disease using state-of-the-art high performance computing systems. The resulting data of this large-scale re-analysis are made available in the presented MetaMap resource.ConclusionsOur results demonstrate that common human RNA-seq data, including those archived in public repositories, might contain valuable information to correlate microbial and viral detection patterns with diverse diseases. The presented MetaMap database thus provides a rich resource for hypothesis generation towards the role of the microbiome in human disease.


Author(s):  
Atta ur Rehman Khan ◽  
Abdul Nasir Khan

Mobile devices are gaining high popularity due to support for a wide range of applications. However, the mobile devices are resource constrained and many applications require high resources. To cater to this issue, the researchers envision usage of mobile cloud computing technology which offers high performance computing, execution of resource intensive applications, and energy efficiency. This chapter highlights importance of mobile devices, high performance applications, and the computing challenges of mobile devices. It also provides a brief introduction to mobile cloud computing technology, its architecture, types of mobile applications, computation offloading process, effective offloading challenges, and high performance computing application on mobile devises that are enabled by mobile cloud computing technology.


2015 ◽  
Author(s):  
Felipe Maciel ◽  
Carina Oliveira ◽  
Renato Juaçaba Neto ◽  
João Alencar ◽  
Paulo Rego ◽  
...  

In this paper, we propose a novel architecture to allow the implementation of a cyber environment composed of different High Performance Computing (HPC) infrastructures (i.e., clusters, grids and clouds). To access this cyber environment, scientific researchers do not have to become computer experts. In particular, we assume that scientific researchers provide a description of the problem as an input to the cyber environment and then get their results without being responsible for managing the computational resources. We provide a prototype of the architecture and introduce an evaluation which studies a real workload of scientific applications executions. The results show the advantages of the proposed architecture. Besides, we highlight this work provides guidelines for developing cyber environments focused on e-Science.


2015 ◽  
Author(s):  
Pierre Carrier ◽  
Bill Long ◽  
Richard Walsh ◽  
Jef Dawson ◽  
Carlos P. Sosa ◽  
...  

High Performance Computing (HPC) Best Practice offers opportunities to implement lessons learned in areas such as computational chemistry and physics in genomics workflows, specifically Next-Generation Sequencing (NGS) workflows. In this study we will briefly describe how distributed-memory parallelism can be an important enhancement to the performance and resource utilization of NGS workflows. We will illustrate this point by showing results on the parallelization of the Inchworm module of the Trinity RNA-Seq pipeline for de novo transcriptome assembly. We show that these types of applications can scale to thousands of cores. Time scaling as well as memory scaling will be discussed at length using two RNA-Seq datasets, targeting the Mus musculus (mouse) and the Axolotl (Mexican salamander). Details about the efficient MPI communication and the impact on performance will also be shown. We hope to demonstrate that this type of parallelization approach can be extended to most types of bioinformatics workflows, with substantial benefits. The efficient, distributed-memory parallel implementation eliminates memory bottlenecks and dramatically accelerates NGS analysis. We further include a summary of programming paradigms available to the bioinformatics community, such as C++/MPI.


2020 ◽  
Author(s):  
Ambarish Kumar ◽  
Ali Haider Bangash

AbstractGenomics has emerged as one of the major sources of big data. The task of augmenting data-driven challenges into bioinformatics can be met using technologies of parallel and distributed computing. GATK4 tools for genomic variants detection are enabled for high-performance computing platforms – SPARK Map Reduce framework. GATK4+WDL+CROMWELL+SPARK+DOCKER is proposed as the way forward in achieving automation, reproducibility, reusability, customization, portability and scalability. SPARK-based tools perform equally well in genomic variants detection with that of standard implementation of GATK4 tools over a command-line interface. Implementation of workflows over cloud-based high-performance computing platforms will enhance usability and will be a way forward in community research and infrastructure development for genomic variant discovery.


GigaScience ◽  
2020 ◽  
Vol 9 (3) ◽  
Author(s):  
Haris Zafeiropoulos ◽  
Ha Quoc Viet ◽  
Katerina Vasileiadou ◽  
Antonis Potirakis ◽  
Christos Arvanitidis ◽  
...  

Abstract Background Environmental DNA and metabarcoding allow the identification of a mixture of species and launch a new era in bio- and eco-assessment. Many steps are required to obtain taxonomically assigned matrices from raw data. For most of these, a plethora of tools are available; each tool's execution parameters need to be tailored to reflect each experiment's idiosyncrasy. Adding to this complexity, the computation capacity of high-performance computing systems is frequently required for such analyses. To address the difficulties, bioinformatic pipelines need to combine state-of-the art technologies and algorithms with an easy to get-set-use framework, allowing researchers to tune each study. Software containerization technologies ease the sharing and running of software packages across operating systems; thus, they strongly facilitate pipeline development and usage. Likewise programming languages specialized for big data pipelines incorporate features like roll-back checkpoints and on-demand partial pipeline execution. Findings PEMA is a containerized assembly of key metabarcoding analysis tools that requires low effort in setting up, running, and customizing to researchers’ needs. Based on third-party tools, PEMA performs read pre-processing, (molecular) operational taxonomic unit clustering, amplicon sequence variant inference, and taxonomy assignment for 16S and 18S ribosomal RNA, as well as ITS and COI marker gene data. Owing to its simplified parameterization and checkpoint support, PEMA allows users to explore alternative algorithms for specific steps of the pipeline without the need of a complete re-execution. PEMA was evaluated against both mock communities and previously published datasets and achieved results of comparable quality. Conclusions A high-performance computing–based approach was used to develop PEMA; however, it can be used in personal computers as well. PEMA's time-efficient performance and good results will allow it to be used for accurate environmental DNA metabarcoding analysis, thus enhancing the applicability of next-generation biodiversity assessment studies.


Geosciences ◽  
2021 ◽  
Vol 11 (2) ◽  
pp. 72
Author(s):  
Muhammad Rizwan Riaz ◽  
Hiroki Motoyama ◽  
Muneo Hori

Recent achievement of research on soil-structure interaction (SSI) is reviewed, with a main focus on the numerical analysis. The review is based on the continuum mechanics theory and the use of high-performance computing (HPC) and clarifies the characteristics of a wide range of treatment of SSI from a simplified model to a high fidelity model. Emphasized is that all the treatment can be regarded as the result of the mathematical approximations in solving a physical continuum mechanics problem of a soil-structure system. The use of HPC is inevitable if we need to obtain a solution of higher accuracy and finer resolution. An example of using HPC for the analysis of SSI is presented.


Author(s):  
Yaser Jararweh ◽  
Moath Jarrah ◽  
Abdelkader Bousselham

Current state-of-the-art GPU-based systems offer unprecedented performance advantages through accelerating the most compute-intensive portions of applications by an order of magnitude. GPU computing presents a viable solution for the ever-increasing complexities in applications and the growing demands for immense computational resources. In this paper the authors investigate different platforms of GPU-based systems, starting from the Personal Supercomputing (PSC) to cloud-based GPU systems. The authors explore and evaluate the GPU-based platforms and the authors present a comparison discussion against the conventional high performance cluster-based computing systems. The authors' evaluation shows potential advantages of using GPU-based systems for high performance computing applications while meeting different scaling granularities.


2016 ◽  
pp. 2373-2384
Author(s):  
Yaser Jararweh ◽  
Moath Jarrah ◽  
Abdelkader Bousselham

Current state-of-the-art GPU-based systems offer unprecedented performance advantages through accelerating the most compute-intensive portions of applications by an order of magnitude. GPU computing presents a viable solution for the ever-increasing complexities in applications and the growing demands for immense computational resources. In this paper the authors investigate different platforms of GPU-based systems, starting from the Personal Supercomputing (PSC) to cloud-based GPU systems. The authors explore and evaluate the GPU-based platforms and the authors present a comparison discussion against the conventional high performance cluster-based computing systems. The authors' evaluation shows potential advantages of using GPU-based systems for high performance computing applications while meeting different scaling granularities.


2020 ◽  
Author(s):  
Maria Moreno de Castro ◽  
Stephan Kindermann ◽  
Sandro Fiore ◽  
Paola Nassisi ◽  
Guillaume Levavasseur ◽  
...  

<p>Earth System observational and model data volumes are constantly increasing and it can be challenging to discover, download, and analyze data if scientists do not have the required computing and storage resources at hand. This is especially the case for detection and attribution studies in the field of climate change research since we need to perform multi-source and cross-disciplinary comparisons for datasets of high-spatial and large temporal coverage. Researchers and end-users are therefore looking for access to cloud solutions and high performance compute facilities. The Earth System Grid Federation (ESGF, https://esgf.llnl.gov/) maintains a global system of federated data centers that allow access to the largest archive of model climate data world-wide. ESGF portals provide free access to the output of the data contributing to the next assessment report of the Intergovernmental Panel on Climate Change through the Coupled Model Intercomparison Project. In order to support users to directly access to high performance computing facilities to perform analyses such as detection and attribution of climate change and its impacts, the EU Commission funded a new service within the infrastructure of the European Network for Earth System Modelling (ENES, https://portal.enes.org/data/data-metadata-service/analysis-platforms). This new service is designed to reduce data transfer issues, speed up the computational analysis, provide storage, and ensure the resources access and maintenance. Furthermore, the service is free of charge, only requires a lightweight application. We will present a demo on how flexible it is to calculate climate indices from different ESGF datasets covering a wide range of temporal and spatial scales using cdo (Climate Data Operators, https://code.mpimet.mpg.de/projects/cdo/) and Jupyter notebooks running directly on the ENES partners: the DKRZ (Germany), JASMIN (UK), CMCC(Italy), and IPSL (France) high performance computing centers.</p>


Sign in / Sign up

Export Citation Format

Share Document