scholarly journals Bioinformatics pipeline using JUDI: Just Do It!

2019 ◽  
Vol 36 (8) ◽  
pp. 2572-2574
Author(s):  
Soumitra Pal ◽  
Teresa M Przytycka

Abstract Summary Large-scale data analysis in bioinformatics requires pipelined execution of multiple software. Generally each stage in a pipeline takes considerable computing resources and several workflow management systems (WMS), e.g. Snakemake, Nextflow, Common Workflow Language, Galaxy, etc. have been developed to ensure optimum execution of the stages across two invocations of the pipeline. However, when the pipeline needs to be executed with different settings of parameters, e.g. thresholds, underlying algorithms, etc. these WMS require significant scripting to ensure an optimal execution. We developed JUDI on top of DoIt, a Python based WMS, to systematically handle parameter settings based on the principles of database management systems. Using a novel modular approach that encapsulates a parameter database in each task and file associated with a pipeline stage, JUDI simplifies plug-and-play of the pipeline stages. For a typical pipeline with n parameters, JUDI reduces the number of lines of scripting required by a factor of O(n). With properly designed parameter databases, JUDI not only enables reproducing research under published values of parameters but also facilitates exploring newer results under novel parameter settings. Availability and implementation https://github.com/ncbi/JUDI Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Soumitra Pal ◽  
Teresa M. Przytycka

AbstractLarge-scale data analysis in Bioinformatics requires executing several software in a pipelined fashion. Generally each stage in a pipeline takes considerable computing resources and several workflow management systems (WMS), e.g., Snakemake and Nextflow, have been developed to ensure in case of multiple invocation of the pipeline, only the bare minimum stages that are affected by the changes across invocations get executed. However, when the pipeline needs to be executed with different settings of parameters, e.g., thresholds, underlying algorithms, etc. these WMS require significant scripting to ensure an optimal execution of the pipeline. We developed JUDI on top of a Python based WMS, DoIt, for a systematic handling of pipeline parameter settings based on the principles of DBMS that simplifies plug-and-play scripting. The effectiveness of JUDI is demonstrated in a pipeline for analyzing large scale HT-SELEX data for transcription factor and DNA binding where JUDI reduces scripting by a factor of five.Availabilityhttps://github.com/ncbi/JUDI


2019 ◽  
Author(s):  
Zachary B. Abrams ◽  
Caitlin E. Coombes ◽  
Suli Li ◽  
Kevin R. Coombes

AbstractSummaryUnsupervised data analysis in many scientific disciplines is based on calculating distances between observations and finding ways to visualize those distances. These kinds of unsupervised analyses help researchers uncover patterns in large-scale data sets. However, researchers can select from a vast number of different distance metrics, each designed to highlight different aspects of different data types. There are also numerous visualization methods with their own strengths and weaknesses. To help researchers perform unsupervised analyses, we developed the Mercator R package. Mercator enables users to see important patterns in their data by generating multiple visualizations using different standard algorithms, making it particularly easy to compare and contrast the results arising from different metrics. By allowing users to select the distance metric that best fits their needs, Mercator helps researchers perform unsupervised analyses that use pattern identification through computation and visual inspection.Availability and ImplementationMercator is freely available at the Comprehensive R Archive Network (https://cran.r-project.org/web/packages/Mercator/index.html)[email protected] informationSupplementary data are available at Bioinformatics online.


2003 ◽  
Vol 12 (04) ◽  
pp. 411-440 ◽  
Author(s):  
Roberto Silveira Silva Filho ◽  
Jacques Wainer ◽  
Edmundo R. M. Madeira

Standard client-server workflow management systems are usually designed as client-server systems. The central server is responsible for the coordination of the workflow execution and, in some cases, may manage the activities database. This centralized control architecture may represent a single point of failure, which compromises the availability of the system. We propose a fully distributed and configurable architecture for workflow management systems. It is based on the idea that the activities of a case (an instance of the process) migrate from host to host, executing the workflow tasks, following a process plan. This core architecture is improved with the addition of other distributed components so that other requirements for Workflow Management Systems, besides scalability, are also addressed. The components of the architecture were tested in different distributed and centralized configurations. The ability to configure the location of components and the use of dynamic allocation of tasks were effective for the implementation of load balancing policies.


2006 ◽  
Vol 3 (1) ◽  
pp. 45-55
Author(s):  
P. Romano ◽  
G. Bertolini ◽  
F. De Paoli ◽  
M. Fattore ◽  
D. Marra ◽  
...  

Summary The Human Genome Project has deeply transformed biology and the field has since then expanded to the management, processing, analysis and visualization of large quantities of data from genomics, proteomics, medicinal chemistry and drug screening. This huge amount of data and the heterogeneity of software tools that are used implies the adoption on a very large scale of new, flexible tools that can enable researchers to integrate data and analysis on the network. ICT technology standards and tools, like Web Services and related languages, and workflow management systems, can support the creation and deployment of such systems. While a number of Web Services are appearing and personal workflow management systems are also being more and more offered to researchers, a reference portal enabling the vast majority of unskilled researchers to take profit from these new technologies is still lacking. In this paper, we introduce the rationale for the creation of such a portal and present the architecture and some preliminary results for the development of a portal for the enactment of workflows of interest in oncology.


2003 ◽  
Vol 34 (3) ◽  
pp. 40-47 ◽  
Author(s):  
Yuosre F. Badir ◽  
Rémi Founou ◽  
Claude Stricker ◽  
Vincent Bourquin

GigaScience ◽  
2020 ◽  
Vol 9 (6) ◽  
Author(s):  
Michael Kluge ◽  
Marie-Sophie Friedl ◽  
Amrei L Menzel ◽  
Caroline C Friedel

Abstract Background Advances in high-throughput methods have brought new challenges for biological data analysis, often requiring many interdependent steps applied to a large number of samples. To address this challenge, workflow management systems, such as Watchdog, have been developed to support scientists in the (semi-)automated execution of large analysis workflows. Implementation Here, we present Watchdog 2.0, which implements new developments for module creation, reusability, and documentation and for reproducibility of analyses and workflow execution. Developments include a graphical user interface for semi-automatic module creation from software help pages, sharing repositories for modules and workflows, and a standardized module documentation format. The latter allows generation of a customized reference book of public and user-specific modules. Furthermore, extensive logging of workflow execution, module and software versions, and explicit support for package managers and container virtualization now ensures reproducibility of results. A step-by-step analysis protocol generated from the log file may, e.g., serve as a draft of a manuscript methods section. Finally, 2 new execution modes were implemented. One allows resuming workflow execution after interruption or modification without rerunning successfully executed tasks not affected by changes. The second one allows detaching and reattaching to workflow execution on a local computer while tasks continue running on computer clusters. Conclusions Watchdog 2.0 provides several new developments that we believe to be of benefit for large-scale bioinformatics analysis and that are not completely covered by other competing workflow management systems. The software itself, module and workflow repositories, and comprehensive documentation are freely available at https://www.bio.ifi.lmu.de/watchdog.


Sign in / Sign up

Export Citation Format

Share Document