CGAT-core: a python framework for building scalable, reproducible computational biology workflows

In the genomics era computational biologists regularly need to process, analyse and integrate large and complex biomedical datasets. Analysis inevitably involves multiple dependent steps, resulting in complex pipelines or workflows, often with several branches. Large data volumes mean that processing needs to be quick and efficient and scientific rigour requires that analysis be consistent and fully reproducible. We have developed CGAT-core, a python package for the rapid construction of complex computational workflows. CGAT-core seamlessly handles parallelisation across high performance computing clusters, integration of Conda environments, full parameterisation, database integration and logging. To illustrate our workflow framework, we present a pipeline for the analysis of RNAseq data using pseudo-alignment.

Download Full-text

CGAT-core: a python framework for building scalable, reproducible computational biology workflows

10.1101/581009 ◽

2019 ◽

Author(s):

Adam Cribbs ◽

Sebastian Luna-Valero ◽

Charlotte George ◽

Ian M Sudbery ◽

Antonio J Berlanga-Taylor ◽

...

Keyword(s):

Computational Biology ◽

High Performance Computing ◽

High Performance ◽

Large Data ◽

Database Integration ◽

Scientific Rigour ◽

Rapid Construction ◽

Rnaseq Data ◽

Performance Computing ◽

Python Package

In the genomics era computational biologists regularly need to process, analyse and integrate large and complex biomedical datasets. Analysis inevitably involves multiple dependent steps, resulting in complex pipelines or workflows, often with several branches. Large data volumes mean that processing needs to be quick and efficient and scientific rigour requires that analysis be consistent and fully reproducible. We have developed CGAT-core, a python package for the rapid construction of complex computational workflows. CGAT-core seamlessly handles parallelisation across high performance computing clusters, integration of Conda environments, full parameterisation, database integration and logging. To illustrate our workflow framework, we present a pipeline for the analysis of RNAseq data using pseudo-alignment.

Download Full-text

Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection

Conquering Big Data with High Performance Computing ◽

10.1007/978-3-319-33742-5_13 ◽

2016 ◽

pp. 269-286 ◽

Cited By ~ 2

Author(s):

Ritu Arora ◽

Jessica Trelogan ◽

Trung Nguyen Ba

Keyword(s):

Data Collection ◽

High Performance Computing ◽

High Performance ◽

Large Data ◽

Performance Computing

Download Full-text

High-Performance Computing for Computational Biology

Biocomputing 2001 ◽

10.1142/9789814447362_0024 ◽

2000 ◽

Author(s):

Thomas Ferrin ◽

Bruce A. Foster ◽

Richard Hughey

Keyword(s):

Computational Biology ◽

High Performance Computing ◽

High Performance ◽

Performance Computing

Download Full-text

Applications of High Performance Computing in Bioinformatics, Computational Biology and Computational Chemistry

Bioinformatics and Biomedical Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-319-16480-9_51 ◽

2015 ◽

pp. 527-541 ◽

Cited By ~ 5

Author(s):

Horacio Peréz-Sánchez ◽

Afshin Fassihi ◽

José M. Cecilia ◽

Hesham H. Ali ◽

Mario Cannataro

Keyword(s):

Computational Biology ◽

Computational Chemistry ◽

High Performance Computing ◽

High Performance ◽

Performance Computing

Download Full-text

Artificial Intelligence: An Energy Efficiency Tool for Enhanced High performance computing

Symmetry ◽

10.3390/sym12061029 ◽

2020 ◽

Vol 12 (6) ◽

pp. 1029

Author(s):

Anabi Hilary Kelechi ◽

Mohammed H. Alsharif ◽

Okpe Jonah Bameyi ◽

Paul Joan Ezra ◽

Iorshase Kator Joseph ◽

...

Keyword(s):

Artificial Intelligence ◽

Energy Efficiency ◽

High Performance Computing ◽

High Performance ◽

Large Data ◽

Computing Systems ◽

Computing Power ◽

Product Delivery ◽

High Level ◽

Performance Computing

Power-consuming entities such as high performance computing (HPC) sites and large data centers are growing with the advance in information technology. In business, HPC is used to enhance the product delivery time, reduce the production cost, and decrease the time it takes to develop a new product. Today’s high level of computing power from supercomputers comes at the expense of consuming large amounts of electric power. It is necessary to consider reducing the energy required by the computing systems and the resources needed to operate these computing systems to minimize the energy utilized by HPC entities. The database could improve system energy efficiency by sampling all the components’ power consumption at regular intervals and the information contained in a database. The information stored in the database will serve as input data for energy-efficiency optimization. More so, device workload information and different usage metrics are stored in the database. There has been strong momentum in the area of artificial intelligence (AI) as a tool for optimizing and processing automation by leveraging on already existing information. This paper discusses ideas for improving energy efficiency for HPC using AI.

Download Full-text

Accomplishments and Challenges in High Performance Computing for Computational Biology

Current Bioinformatics ◽

10.2174/157489306777011888 ◽

2006 ◽

Vol 1 (2) ◽

pp. 185-195

Author(s):

Zhihua Du ◽

Feng Lin ◽

Bertil Schmidt

Keyword(s):

Computational Biology ◽

High Performance Computing ◽

High Performance ◽

Performance Computing

Download Full-text

High performance computing for computational biology

10.32657/10356/47481 ◽

2005 ◽

Author(s):

Zhihua Du

Keyword(s):

Computational Biology ◽

High Performance Computing ◽

High Performance ◽

Performance Computing

Download Full-text

EFMlrs: a Python package for elementary flux mode enumeration via lexicographic reverse search

BMC Bioinformatics ◽

10.1186/s12859-021-04417-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Bianca A Buchner ◽

Jürgen Zanghellini

Keyword(s):

High Performance Computing ◽

High Performance ◽

Metabolic Networks ◽

Stoichiometric Matrix ◽

Elementary Flux Mode ◽

Source Program ◽

Metabolic Models ◽

Performance Computing ◽

Python Package ◽

Flux Mode

Abstract Background Elementary flux mode (EFM) analysis is a well-established, yet computationally challenging approach to characterize metabolic networks. Standard algorithms require huge amounts of memory and lack scalability which limits their application to single servers and consequently limits a comprehensive analysis to medium-scale networks. Recently, Avis et al. developed —a parallel version of the lexicographic reverse search (lrs) algorithm, which, in principle, enables an EFM analysis on high-performance computing environments (Avis and Jordan. mplrs: a scalable parallel vertex/facet enumeration code. arXiv:1511.06487, 2017). Here we test its applicability for EFM enumeration. Results We developed , a Python package that gives users access to the enumeration capabilities of . uses COBRApy to process metabolic models from sbml files, performs loss-free compressions of the stoichiometric matrix, and generates suitable inputs for as well as , providing support not only for our proposed new method for EFM enumeration but also for already established tools. By leveraging COBRApy, also allows the application of additional reaction boundaries and seamlessly integrates into existing workflows. Conclusion We show that due to ’s properties, the algorithm is perfectly suited for high-performance computing (HPC) and thus offers new possibilities for the unbiased analysis of substantially larger metabolic models via EFM analyses. is an open-source program that comes together with a designated workflow and can be easily installed via pip.

Download Full-text

Parallelism in computational biology

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016677599 ◽

2016 ◽

Vol 32 (3) ◽

pp. 317-320 ◽

Cited By ~ 2

Author(s):

Miguel A Vega-Rodríguez ◽

Álvaro Rubio-Largo

Keyword(s):

Computational Biology ◽

High Performance Computing ◽

High Performance ◽

State Of The Art ◽

International Workshop ◽

Special Issue ◽

High Quality ◽

The Third ◽

Points Of View ◽

Performance Computing

Computational biology allows and encourages the application of many different parallelism-based technologies. This special issue brings together high-quality state-of-the-art contributions about parallelism-based technologies in computational biology, from different points of view or perspectives, that is, from diverse high-performance computing applications. The special issue collects considerably extended and improved versions of the best papers, accepted and presented in PBio 2015 (the Third International Workshop on Parallelism in Bioinformatics, and part of IEEE ISPA 2015 ). The domains and topics covered in these seven papers are timely and important, and the authors have done an excellent job of presenting the material.

Download Full-text