scholarly journals GenPipes: an open-source framework for distributed and scalable genomic analyses

2018 ◽  
Author(s):  
Mathieu Bourgey ◽  
Rola Dali ◽  
Robert Eveleigh ◽  
Kuang Chung Chen ◽  
Louis Letourneau ◽  
...  

ABSTRACTWith the decreasing cost of sequencing and the rapid developments in genomics technologies and protocols, the need for validated bioinformatics software that enables efficient large-scale data processing is growing. Here we present GenPipes, a flexible Python-based framework that facilitates the development and deployment of multi-step workflows optimized for High Performance Computing clusters and the cloud. GenPipes already implements 12 validated and scalable pipelines for various genomics applications, including RNA-Seq, ChIP-Seq, DNA-Seq, Methyl-Seq, Hi-C, capture Hi-C, metagenomics and PacBio long read assembly. The software is available under a GPLv3 open source license and is continuously updated to follow recent advances in genomics and bioinformatics. The framework has been already configured on several servers and a docker image is also available to facilitate additional installations. In summary, GenPipes offers genomic researchers a simple method to analyze different types of data, customizable to their needs and resources, as well as the flexibility to create their own workflows.

Author(s):  
Gordon Bell ◽  
David H Bailey ◽  
Jack Dongarra ◽  
Alan H Karp ◽  
Kevin Walsh

The Gordon Bell Prize is awarded each year by the Association for Computing Machinery to recognize outstanding achievement in high-performance computing (HPC). The purpose of the award is to track the progress of parallel computing with particular emphasis on rewarding innovation in applying HPC to applications in science, engineering, and large-scale data analytics. Prizes may be awarded for peak performance or special achievements in scalability and time-to-solution on important science and engineering problems. Financial support for the US$10,000 award is provided through an endowment by Gordon Bell, a pioneer in high-performance and parallel computing. This article examines the evolution of the Gordon Bell Prize and the impact it has had on the field.


2013 ◽  
Vol 2013 ◽  
pp. 1-6 ◽  
Author(s):  
Ying-Chih Lin ◽  
Chin-Sheng Yu ◽  
Yen-Jen Lin

Recent progress in high-throughput instrumentations has led to an astonishing growth in both volume and complexity of biomedical data collected from various sources. The planet-size data brings serious challenges to the storage and computing technologies. Cloud computing is an alternative to crack the nut because it gives concurrent consideration to enable storage and high-performance computing on large-scale data. This work briefly introduces the data intensive computing system and summarizes existing cloud-based resources in bioinformatics. These developments and applications would facilitate biomedical research to make the vast amount of diversification data meaningful and usable.


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Bingzheng Li ◽  
Jinchen Xu ◽  
Zijing Liu

With the development of high-performance computing and big data applications, the scale of data transmitted, stored, and processed by high-performance computing cluster systems is increasing explosively. Efficient compression of large-scale data and reducing the space required for data storage and transmission is one of the keys to improving the performance of high-performance computing cluster systems. In this paper, we present SW-LZMA, a parallel design and optimization of LZMA based on the Sunway 26010 heterogeneous many-core processor. Combined with the characteristics of SW26010 processors, we analyse the storage space requirements, memory access characteristics, and hotspot functions of the LZMA algorithm and implement the thread-level parallelism of the LZMA algorithm based on Athread interface. Furthermore, we make a fine-grained layout of LDM address space to achieve DMA double buffer cyclic sliding window algorithm, which optimizes the performance of SW-LZMA. The experimental results show that compared with the serial baseline implementation of LZMA, the parallel LZMA algorithm obtains a maximum speedup ratio of 4.1 times using the Silesia corpus benchmark, while on the large-scale data set, speedup is 5.3 times.


2018 ◽  
Author(s):  
LM Simon ◽  
S Karg ◽  
AJ Westermann ◽  
M Engel ◽  
AHA Elbehery ◽  
...  

AbstractBackgroundWith the advent of the age of big data in bioinformatics, large volumes of data and high performance computing power enable researchers to perform re-analyses of publicly available datasets at an unprecedented scale. Ever more studies imply the microbiome in both normal human physiology and a wide range of diseases. RNA sequencing technology (RNA-seq) is commonly used to infer global eukaryotic gene expression patterns under defined conditions, including human disease-related contexts, but its generic nature also enables the detection of microbial and viral transcripts.FindingsWe developed a bioinformatic pipeline to screen existing human RNA-seq datasets for the presence of microbial and viral reads by re-inspecting the non-human-mapping read fraction. We validated this approach by recapitulating outcomes from 6 independent controlled infection experiments of cell line models and comparison with an alternative metatranscriptomic mapping strategy. We then applied the pipeline to close to 150 terabytes of publicly available raw RNA-seq data from >17,000 samples from >400 studies relevant to human disease using state-of-the-art high performance computing systems. The resulting data of this large-scale re-analysis are made available in the presented MetaMap resource.ConclusionsOur results demonstrate that common human RNA-seq data, including those archived in public repositories, might contain valuable information to correlate microbial and viral detection patterns with diverse diseases. The presented MetaMap database thus provides a rich resource for hypothesis generation towards the role of the microbiome in human disease.


2018 ◽  
Vol 106 (4) ◽  
Author(s):  
Jean-Paul Courneya ◽  
Alexa Mayo

Despite having an ideal setup in their labs for wet work, researchers often lack the computational infrastructure to analyze the magnitude of data that result from “-omics” experiments. In this innovative project, the library supports analysis of high-throughput data from global molecular profiling experiments by offering a high-performance computer with open source software along with expert bioinformationist support. The audience for this new service is faculty, staff, and students for whom using the university’s large scale, CORE computational resources is not warranted because these resources exceed the needs of smaller projects. In the library’s approach, users are empowered to analyze high-throughput data that they otherwise would not be able to on their own computers. To develop the project, the library’s bioinformationist identified the ideal computing hardware and a group of open source bioinformatics software to provide analysis options for experimental data such as scientific images, sequence reads, and flow cytometry files. To close the loop between learning and practice, the bioinformationist developed self-guided learning materials and workshops or consultations on topics such as the National Center for Biotechnology Information’s BLAST, Bioinformatics on the Cloud, and ImageJ. Researchers apply the data analysis techniques that they learned in the classroom in an ideal computing environment.


2019 ◽  
Vol 35 (21) ◽  
pp. 4493-4495 ◽  
Author(s):  
Michael Schubert

Abstract Motivation High performance computing (HPC) clusters play a pivotal role in large-scale bioinformatics analysis and modeling. For the statistical computing language R, packages exist to enable a user to submit their analyses as jobs on HPC schedulers. However, these packages do not scale well to high numbers of tasks, and their processing overhead quickly becomes a prohibitive bottleneck. Results Here we present clustermq, an R package that can process analyses up to three orders of magnitude faster than previously published alternatives. We show this for investigating genomic associations of drug sensitivity in cancer cell lines, but it can be applied to any kind of parallelizable workflow. Availability and implementation The package is available on CRAN and https://github.com/mschubert/clustermq. Code for performance testing is available at https://github.com/mschubert/clustermq-performance. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Balraj Singh ◽  
Harsh K Verma

Background: Extreme growth of data necessitates the need for high-performance computing. MapReduce is among the most sought-after platform for processing large-scale data. Research work and analysis of the existing system has revealed its performance bottlenecks and areas of the concern. MapReduce suffers extremely from the problem of skew and load imbalance on its processing nodes. Objective: This paper proposes a novel technique for MapReduce to lower the skew on Map tasks and improve the load balance. It reduces the execution time of job by lowering the completion time of the slowest task. Method:Proposed method performs one-time settlement of load balancing among the Map tasks by analyzing the expected completion time of the Map tasks and redistributes the load. It uses intervals to migrate the overloaded or slows tasks and append them on the under loaded tasks or free slots. Result:Experiments reveal an improvement of up to 1.3x by implementing the proposed strategy and comparing it with the relevant techniques using different datasets. Conclusion:Significant improvement is observed in the performance as a result of lower completion time of a job. Proposed technique exhibits reduced amount of skew and a uniform distribution of load among Map nodes.


Sign in / Sign up

Export Citation Format

Share Document