scholarly journals CGAT-core: a python framework for building scalable, reproducible computational biology workflows

F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 377 ◽  
Author(s):  
Adam P. Cribbs ◽  
Sebastian Luna-Valero ◽  
Charlotte George ◽  
Ian M. Sudbery ◽  
Antonio J. Berlanga-Taylor ◽  
...  

In the genomics era computational biologists regularly need to process, analyse and integrate large and complex biomedical datasets. Analysis inevitably involves multiple dependent steps, resulting in complex pipelines or workflows, often with several branches. Large data volumes mean that processing needs to be quick and efficient and scientific rigour requires that analysis be consistent and fully reproducible. We have developed CGAT-core, a python package for the rapid construction of complex computational workflows. CGAT-core seamlessly handles parallelisation across high performance computing clusters, integration of Conda environments, full parameterisation, database integration and logging. To illustrate our workflow framework, we present a pipeline for the analysis of RNAseq data using pseudo-alignment.

F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 377 ◽  
Author(s):  
Adam P. Cribbs ◽  
Sebastian Luna-Valero ◽  
Charlotte George ◽  
Ian M. Sudbery ◽  
Antonio J. Berlanga-Taylor ◽  
...  

In the genomics era computational biologists regularly need to process, analyse and integrate large and complex biomedical datasets. Analysis inevitably involves multiple dependent steps, resulting in complex pipelines or workflows, often with several branches. Large data volumes mean that processing needs to be quick and efficient and scientific rigour requires that analysis be consistent and fully reproducible. We have developed CGAT-core, a python package for the rapid construction of complex computational workflows. CGAT-core seamlessly handles parallelisation across high performance computing clusters, integration of Conda environments, full parameterisation, database integration and logging. To illustrate our workflow framework, we present a pipeline for the analysis of RNAseq data using pseudo-alignment.


2019 ◽  
Author(s):  
Adam Cribbs ◽  
Sebastian Luna-Valero ◽  
Charlotte George ◽  
Ian M Sudbery ◽  
Antonio J Berlanga-Taylor ◽  
...  

In the genomics era computational biologists regularly need to process, analyse and integrate large and complex biomedical datasets. Analysis inevitably involves multiple dependent steps, resulting in complex pipelines or workflows, often with several branches. Large data volumes mean that processing needs to be quick and efficient and scientific rigour requires that analysis be consistent and fully reproducible. We have developed CGAT-core, a python package for the rapid construction of complex computational workflows. CGAT-core seamlessly handles parallelisation across high performance computing clusters, integration of Conda environments, full parameterisation, database integration and logging. To illustrate our workflow framework, we present a pipeline for the analysis of RNAseq data using pseudo-alignment.


Symmetry ◽  
2020 ◽  
Vol 12 (6) ◽  
pp. 1029
Author(s):  
Anabi Hilary Kelechi ◽  
Mohammed H. Alsharif ◽  
Okpe Jonah Bameyi ◽  
Paul Joan Ezra ◽  
Iorshase Kator Joseph ◽  
...  

Power-consuming entities such as high performance computing (HPC) sites and large data centers are growing with the advance in information technology. In business, HPC is used to enhance the product delivery time, reduce the production cost, and decrease the time it takes to develop a new product. Today’s high level of computing power from supercomputers comes at the expense of consuming large amounts of electric power. It is necessary to consider reducing the energy required by the computing systems and the resources needed to operate these computing systems to minimize the energy utilized by HPC entities. The database could improve system energy efficiency by sampling all the components’ power consumption at regular intervals and the information contained in a database. The information stored in the database will serve as input data for energy-efficiency optimization. More so, device workload information and different usage metrics are stored in the database. There has been strong momentum in the area of artificial intelligence (AI) as a tool for optimizing and processing automation by leveraging on already existing information. This paper discusses ideas for improving energy efficiency for HPC using AI.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Bianca A Buchner ◽  
Jürgen Zanghellini

Abstract Background Elementary flux mode (EFM) analysis is a well-established, yet computationally challenging approach to characterize metabolic networks. Standard algorithms require huge amounts of memory and lack scalability which limits their application to single servers and consequently limits a comprehensive analysis to medium-scale networks. Recently, Avis et al. developed —a parallel version of the lexicographic reverse search (lrs) algorithm, which, in principle, enables an EFM analysis on high-performance computing environments (Avis and Jordan. mplrs: a scalable parallel vertex/facet enumeration code. arXiv:1511.06487, 2017). Here we test its applicability for EFM enumeration. Results We developed , a Python package that gives users access to the enumeration capabilities of . uses COBRApy to process metabolic models from sbml files, performs loss-free compressions of the stoichiometric matrix, and generates suitable inputs for as well as , providing support not only for our proposed new method for EFM enumeration but also for already established tools. By leveraging COBRApy, also allows the application of additional reaction boundaries and seamlessly integrates into existing workflows. Conclusion We show that due to ’s properties, the algorithm is perfectly suited for high-performance computing (HPC) and thus offers new possibilities for the unbiased analysis of substantially larger metabolic models via EFM analyses. is an open-source program that comes together with a designated workflow and can be easily installed via pip.


Author(s):  
Miguel A Vega-Rodríguez ◽  
Álvaro Rubio-Largo

Computational biology allows and encourages the application of many different parallelism-based technologies. This special issue brings together high-quality state-of-the-art contributions about parallelism-based technologies in computational biology, from different points of view or perspectives, that is, from diverse high-performance computing applications. The special issue collects considerably extended and improved versions of the best papers, accepted and presented in PBio 2015 (the Third International Workshop on Parallelism in Bioinformatics, and part of IEEE ISPA 2015 ). The domains and topics covered in these seven papers are timely and important, and the authors have done an excellent job of presenting the material.


Sign in / Sign up

Export Citation Format

Share Document