GenPipes: an open-source framework for distributed and scalable genomic analyses

ABSTRACTWith the decreasing cost of sequencing and the rapid developments in genomics technologies and protocols, the need for validated bioinformatics software that enables efficient large-scale data processing is growing. Here we present GenPipes, a flexible Python-based framework that facilitates the development and deployment of multi-step workflows optimized for High Performance Computing clusters and the cloud. GenPipes already implements 12 validated and scalable pipelines for various genomics applications, including RNA-Seq, ChIP-Seq, DNA-Seq, Methyl-Seq, Hi-C, capture Hi-C, metagenomics and PacBio long read assembly. The software is available under a GPLv3 open source license and is continuously updated to follow recent advances in genomics and bioinformatics. The framework has been already configured on several servers and a docker image is also available to facilitate additional installations. In summary, GenPipes offers genomic researchers a simple method to analyze different types of data, customizable to their needs and resources, as well as the flexibility to create their own workflows.

Download Full-text

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining ◽

10.1145/3219819.3219927 ◽

2018 ◽

Cited By ~ 2

Author(s):

Alex Gittens ◽

Kai Rothauge ◽

Shusen Wang ◽

Michael W. Mahoney ◽

Lisa Gerhardt ◽

...

Keyword(s):

Data Analysis ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Download Full-text

Enabling low latency at large-scale data center and high-performance computing interconnect networks using fine-grained all-optical switching technology

2017 International Conference on Optical Network Design and Modeling (ONDM) ◽

10.23919/ondm.2017.7958532 ◽

2017 ◽

Cited By ~ 3

Author(s):

Nan Hua ◽

Zhizhen Zhong ◽

Xiaoping Zheng

Keyword(s):

Data Center ◽

High Performance ◽

Optical Switching ◽

Large Scale ◽

Fine Grained ◽

Large Scale Data ◽

All Optical ◽

Performance Computing ◽

All Optical Switching ◽

Scale Data

Download Full-text

A look back on 30 years of the Gordon Bell Prize

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017738610 ◽

2017 ◽

Vol 31 (6) ◽

pp. 469-484 ◽

Cited By ~ 3

Author(s):

Gordon Bell ◽

David H Bailey ◽

Jack Dongarra ◽

Alan H Karp ◽

Kevin Walsh

Keyword(s):

Parallel Computing ◽

High Performance ◽

Large Scale ◽

Peak Performance ◽

Outstanding Achievement ◽

Computing Machinery ◽

Large Scale Data ◽

The Us ◽

The Impact ◽

Performance Computing

The Gordon Bell Prize is awarded each year by the Association for Computing Machinery to recognize outstanding achievement in high-performance computing (HPC). The purpose of the award is to track the progress of parallel computing with particular emphasis on rewarding innovation in applying HPC to applications in science, engineering, and large-scale data analytics. Prizes may be awarded for peak performance or special achievements in scalability and time-to-solution on important science and engineering problems. Financial support for the US$10,000 award is provided through an endowment by Gordon Bell, a pioneer in high-performance and parallel computing. This article examines the evolution of the Gordon Bell Prize and the impact it has had on the field.

Download Full-text

Enabling Large-Scale Biomedical Analysis in the Cloud

BioMed Research International ◽

10.1155/2013/185679 ◽

2013 ◽

Vol 2013 ◽

pp. 1-6 ◽

Cited By ~ 10

Author(s):

Ying-Chih Lin ◽

Chin-Sheng Yu ◽

Yen-Jen Lin

Keyword(s):

High Performance ◽

Large Scale ◽

Computing System ◽

Biomedical Data ◽

Data Intensive Computing ◽

Biomedical Analysis ◽

Data Intensive ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Recent progress in high-throughput instrumentations has led to an astonishing growth in both volume and complexity of biomedical data collected from various sources. The planet-size data brings serious challenges to the storage and computing technologies. Cloud computing is an alternative to crack the nut because it gives concurrent consideration to enable storage and high-performance computing on large-scale data. This work briefly introduces the data intensive computing system and summarizes existing cloud-based resources in bioinformatics. These developments and applications would facilitate biomedical research to make the vast amount of diversification data meaningful and usable.

Download Full-text

SW-LZMA: Parallel Implementation of LZMA Based on SW26010 Many-Core Processor

Wireless Communications and Mobile Computing ◽

10.1155/2021/4486494 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Bingzheng Li ◽

Jinchen Xu ◽

Zijing Liu

Keyword(s):

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Cluster Systems ◽

Large Scale Data ◽

Many Core ◽

High Performance Computing Cluster ◽

Performance Computing ◽

Scale Data ◽

Computing Cluster

With the development of high-performance computing and big data applications, the scale of data transmitted, stored, and processed by high-performance computing cluster systems is increasing explosively. Efficient compression of large-scale data and reducing the space required for data storage and transmission is one of the keys to improving the performance of high-performance computing cluster systems. In this paper, we present SW-LZMA, a parallel design and optimization of LZMA based on the Sunway 26010 heterogeneous many-core processor. Combined with the characteristics of SW26010 processors, we analyse the storage space requirements, memory access characteristics, and hotspot functions of the LZMA algorithm and implement the thread-level parallelism of the LZMA algorithm based on Athread interface. Furthermore, we make a fine-grained layout of LDM address space to achieve DMA double buffer cyclic sliding window algorithm, which optimizes the performance of SW-LZMA. The experimental results show that compared with the serial baseline implementation of LZMA, the parallel LZMA algorithm obtains a maximum speedup ratio of 4.1 times using the Silesia corpus benchmark, while on the large-scale data set, speedup is 5.3 times.

Download Full-text

MetaMap: An atlas of metatranscriptomic reads in human disease-related RNA-seq data

10.1101/269092 ◽

2018 ◽

Cited By ~ 1

Author(s):

LM Simon ◽

S Karg ◽

AJ Westermann ◽

M Engel ◽

AHA Elbehery ◽

...

Keyword(s):

High Performance Computing ◽

Human Disease ◽

High Performance ◽

Large Scale ◽

Expression Patterns ◽

Rna Seq ◽

Wide Range ◽

Eukaryotic Gene ◽

Public Repositories ◽

Performance Computing

AbstractBackgroundWith the advent of the age of big data in bioinformatics, large volumes of data and high performance computing power enable researchers to perform re-analyses of publicly available datasets at an unprecedented scale. Ever more studies imply the microbiome in both normal human physiology and a wide range of diseases. RNA sequencing technology (RNA-seq) is commonly used to infer global eukaryotic gene expression patterns under defined conditions, including human disease-related contexts, but its generic nature also enables the detection of microbial and viral transcripts.FindingsWe developed a bioinformatic pipeline to screen existing human RNA-seq datasets for the presence of microbial and viral reads by re-inspecting the non-human-mapping read fraction. We validated this approach by recapitulating outcomes from 6 independent controlled infection experiments of cell line models and comparison with an alternative metatranscriptomic mapping strategy. We then applied the pipeline to close to 150 terabytes of publicly available raw RNA-seq data from >17,000 samples from >400 studies relevant to human disease using state-of-the-art high performance computing systems. The resulting data of this large-scale re-analysis are made available in the presented MetaMap resource.ConclusionsOur results demonstrate that common human RNA-seq data, including those archived in public repositories, might contain valuable information to correlate microbial and viral detection patterns with diverse diseases. The presented MetaMap database thus provides a rich resource for hypothesis generation towards the role of the microbiome in human disease.

Download Full-text

High-performance computing service for bioinformatics and data science

Journal of the Medical Library Association JMLA ◽

10.5195/jmla.2018.512 ◽

2018 ◽

Vol 106 (4) ◽

Author(s):

Jean-Paul Courneya ◽

Alexa Mayo

Keyword(s):

Open Source ◽

High Throughput ◽

High Performance ◽

Large Scale ◽

Data Science ◽

Wet Work ◽

High Throughput Data ◽

Guided Learning ◽

Computational Resources ◽

Performance Computing

Despite having an ideal setup in their labs for wet work, researchers often lack the computational infrastructure to analyze the magnitude of data that result from “-omics” experiments. In this innovative project, the library supports analysis of high-throughput data from global molecular profiling experiments by offering a high-performance computer with open source software along with expert bioinformationist support. The audience for this new service is faculty, staff, and students for whom using the university’s large scale, CORE computational resources is not warranted because these resources exceed the needs of smaller projects. In the library’s approach, users are empowered to analyze high-throughput data that they otherwise would not be able to on their own computers. To develop the project, the library’s bioinformationist identified the ideal computing hardware and a group of open source bioinformatics software to provide analysis options for experimental data such as scientific images, sequence reads, and flow cytometry files. To close the loop between learning and practice, the bioinformationist developed self-guided learning materials and workshops or consultations on topics such as the National Center for Biotechnology Information’s BLAST, Bioinformatics on the Cloud, and ImageJ. Researchers apply the data analysis techniques that they learned in the classroom in an ideal computing environment.

Download Full-text

clustermq enables efficient parallelization of genomic analyses

Bioinformatics ◽

10.1093/bioinformatics/btz284 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4493-4495 ◽

Cited By ~ 1

Author(s):

Michael Schubert

Keyword(s):

High Performance ◽

Large Scale ◽

Performance Testing ◽

R Package ◽

Supplementary Information ◽

Statistical Computing ◽

Genomic Analyses ◽

R Packages ◽

Performance Computing ◽

Efficient Parallelization

Abstract Motivation High performance computing (HPC) clusters play a pivotal role in large-scale bioinformatics analysis and modeling. For the statistical computing language R, packages exist to enable a user to submit their analyses as jobs on HPC schedulers. However, these packages do not scale well to high numbers of tasks, and their processing overhead quickly becomes a prohibitive bottleneck. Results Here we present clustermq, an R package that can process analyses up to three orders of magnitude faster than previously published alternatives. We show this for investigating genomic associations of drug sensitivity in cancer cell lines, but it can be applied to any kind of parallelizable workflow. Availability and implementation The package is available on CRAN and https://github.com/mschubert/clustermq. Code for performance testing is available at https://github.com/mschubert/clustermq-performance. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Work in progress — Integration of the scientific workflow paradigm into high performance computing and large scale data management curricula

2010 IEEE Frontiers in Education Conference (FIE) ◽

10.1109/fie.2010.5673235 ◽

2010 ◽

Author(s):

Brandeis Marshall ◽

John Springer ◽

Thomas Hacker

Keyword(s):

Data Management ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Scientific Workflow ◽

Work In Progress ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Download Full-text

IMSM: An Interval Migration Based Approach for Skew Mitigation in MapReduce

Recent Patents on Computer Science ◽

10.2174/2213275912666190405141745 ◽

2019 ◽

Vol 12 ◽

Author(s):

Balraj Singh ◽

Harsh K Verma

Keyword(s):

Load Balance ◽

Completion Time ◽

High Performance ◽

Large Scale ◽

Research Work ◽

Novel Technique ◽

Large Scale Data ◽

Load Imbalance ◽

Performance Computing ◽

Scale Data

Background: Extreme growth of data necessitates the need for high-performance computing. MapReduce is among the most sought-after platform for processing large-scale data. Research work and analysis of the existing system has revealed its performance bottlenecks and areas of the concern. MapReduce suffers extremely from the problem of skew and load imbalance on its processing nodes. Objective: This paper proposes a novel technique for MapReduce to lower the skew on Map tasks and improve the load balance. It reduces the execution time of job by lowering the completion time of the slowest task. Method:Proposed method performs one-time settlement of load balancing among the Map tasks by analyzing the expected completion time of the Map tasks and redistributes the load. It uses intervals to migrate the overloaded or slows tasks and append them on the under loaded tasks or free slots. Result:Experiments reveal an improvement of up to 1.3x by implementing the proposed strategy and comparing it with the relevant techniques using different datasets. Conclusion:Significant improvement is observed in the performance as a result of lower completion time of a job. Proposed technique exhibits reduced amount of skew and a uniform distribution of load among Map nodes.

Download Full-text