scholarly journals BioWorkbench: a high-performance framework for managing and analyzing bioinformatics experiments

PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5551 ◽  
Author(s):  
Maria Luiza Mondelli ◽  
Thiago Magalhães ◽  
Guilherme Loss ◽  
Michael Wilde ◽  
Ian Foster ◽  
...  

Advances in sequencing techniques have led to exponential growth in biological data, demanding the development of large-scale bioinformatics experiments. Because these experiments are computation- and data-intensive, they require high-performance computing techniques and can benefit from specialized technologies such as Scientific Workflow Management Systems and databases. In this work, we present BioWorkbench, a framework for managing and analyzing bioinformatics experiments. This framework automatically collects provenance data, including both performance data from workflow execution and data from the scientific domain of the workflow application. Provenance data can be analyzed through a web application that abstracts a set of queries to the provenance database, simplifying access to provenance information. We evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a RASopathy analysis workflow. We analyze each workflow from both computational and scientific domain perspectives, by using queries to a provenance and annotation database. Some of these queries are available as a pre-built feature of the BioWorkbench web application. Through the provenance data, we show that the framework is scalable and achieves high-performance, reducing up to 98% of the case studies execution time. We also show how the application of machine learning techniques can enrich the analysis process.

2021 ◽  
Author(s):  
◽  
Vahid Arabnejad

<p>Basic science is becoming ever more computationally intensive, increasing the need for large-scale compute and storage resources, be they within a High-Performance Computer cluster, or more recently, within the cloud. Commercial clouds have increasingly become a viable platform for hosting scientific analyses and computation due to their elasticity, recent introduction of specialist hardware, and pay-as-you-go cost model. This computing paradigm therefore presents a low capital and low barrier alternative to operating dedicated eScience infrastructure. Indeed, commercial clouds now enable universal access to capabilities previously available to only large well funded research groups. While the potential benefits of cloud computing are clear, there are still significant technical hurdles associated with obtaining the best execution efficiency whilst trading off cost. In most cases, large scale scientific computation is represented as a workflow for scheduling and runtime provisioning. Such scheduling becomes an even more challenging problem on cloud systems due to the dynamic nature of the cloud, in particular, the elasticity, the pricing models (both static and dynamic), the non-homogeneous resource types and the vast array of services. This mapping of workflow tasks onto a set of provisioned instances is an example of the general scheduling problem and is NP-complete. In addition, certain runtime constraints, the most typical being the cost of the computation and the time which that computation requires to complete, must be met. This thesis addresses 'the scientific workflow scheduling problem in cloud', which is to schedule workflow tasks on cloud resources in a way that users meet their defined constraints such as budget and deadline, and providers maximize profits and resource utilization. Moreover, it explores different mechanisms and strategies for distributing defined constraints over a workflow and investigate its impact on the overall cost of the resulting schedule.</p>


2022 ◽  
Vol 23 (1) ◽  
Author(s):  
Hanjing Jiang ◽  
Yabing Huang

Abstract Background Drug-disease associations (DDAs) can provide important information for exploring the potential efficacy of drugs. However, up to now, there are still few DDAs verified by experiments. Previous evidence indicates that the combination of information would be conducive to the discovery of new DDAs. How to integrate different biological data sources and identify the most effective drugs for a certain disease based on drug-disease coupled mechanisms is still a challenging problem. Results In this paper, we proposed a novel computation model for DDA predictions based on graph representation learning over multi-biomolecular network (GRLMN). More specifically, we firstly constructed a large-scale molecular association network (MAN) by integrating the associations among drugs, diseases, proteins, miRNAs, and lncRNAs. Then, a graph embedding model was used to learn vector representations for all drugs and diseases in MAN. Finally, the combined features were fed to a random forest (RF) model to predict new DDAs. The proposed model was evaluated on the SCMFDD-S data set using five-fold cross-validation. Experiment results showed that GRLMN model was very accurate with the area under the ROC curve (AUC) of 87.9%, which outperformed all previous works in terms of both accuracy and AUC in benchmark dataset. To further verify the high performance of GRLMN, we carried out two case studies for two common diseases. As a result, in the ranking of drugs that were predicted to be related to certain diseases (such as kidney disease and fever), 15 of the top 20 drugs have been experimentally confirmed. Conclusions The experimental results show that our model has good performance in the prediction of DDA. GRLMN is an effective prioritization tool for screening the reliable DDAs for follow-up studies concerning their participation in drug reposition.


Water ◽  
2020 ◽  
Vol 12 (9) ◽  
pp. 2398 ◽  
Author(s):  
Jackie O’Sullivan ◽  
Carmel Pollino ◽  
Peter Taylor ◽  
Ashmita Sengupta ◽  
Amit Parashar

Water resources are under growing pressures globally, and better basin planning is crucial to alleviate current and future water scarcity issues. Communicating the complex interconnections and needs of natural and human systems is a significant research challenge. With advances in cyberinfrastructure allowing for new innovative approaches to basin planning, this same technology can also facilitate better stakeholder engagement. The potential benefits of using digital basin planning platforms for stakeholder engagement are immense; yet, there is limited guidance on how to best use these platforms for more effective stakeholder engagement in water-related issues and projects. We detail our digital platform, Basin Futures, and highlight the potential uses for stakeholder engagement through an integrative framework across different assessment levels. Basin Futures is a web application that is an entry-level modelling tool that aims to support rapid and exploratory basin planning globally. As a cloud-based tool, it brings together high-performance computing and large-scale global datasets to make data analysis accessible and efficient. We explore the potential use of the tool through three case studies exploring agricultural development, transboundary water-sharing agreements and allocating water for environmental flows.


2021 ◽  
Vol 7 ◽  
pp. e606
Author(s):  
Daniel Silva Junior ◽  
Esther Pacitti ◽  
Aline Paes ◽  
Daniel de Oliveira

Scientific Workflows (SWfs) have revolutionized how scientists in various domains of science conduct their experiments. The management of SWfs is performed by complex tools that provide support for workflow composition, monitoring, execution, capturing, and storage of the data generated during execution. In some cases, they also provide components to ease the visualization and analysis of the generated data. During the workflow’s composition phase, programs must be selected to perform the activities defined in the workflow specification. These programs often require additional parameters that serve to adjust the program’s behavior according to the experiment’s goals. Consequently, workflows commonly have many parameters to be manually configured, encompassing even more than one hundred in many cases. Wrongly parameters’ values choosing can lead to crash workflows executions or provide undesired results. As the execution of data- and compute-intensive workflows is commonly performed in a high-performance computing environment e.g., (a cluster, a supercomputer, or a public cloud), an unsuccessful execution configures a waste of time and resources. In this article, we present FReeP—Feature Recommender from Preferences, a parameter value recommendation method that is designed to suggest values for workflow parameters, taking into account past user preferences. FReeP is based on Machine Learning techniques, particularly in Preference Learning. FReeP is composed of three algorithms, where two of them aim at recommending the value for one parameter at a time, and the third makes recommendations for n parameters at once. The experimental results obtained with provenance data from two broadly used workflows showed FReeP usefulness in the recommendation of values for one parameter. Furthermore, the results indicate the potential of FReeP to recommend values for n parameters in scientific workflows.


Author(s):  
Wahidin Saputra ◽  
Elfi Tasrif

Entrepreneurship plays an important role in Indonesia's development. Entrepreneurship is important because of the magnitude of the role played by entrepreneurs in overcoming various problems of national economic development such as poverty alleviation, high unemployment, the Entrepreneurial Student Program is a program of the Ministry of Education and Culture's Directorate General of Higher Education implemented and developed by universities. Padang State University is one of the universities that has received assistance from the Directorate General of Higher Education. The Entrepreneurial Student Program Information System built has a web-based display, where with this information system students or other users can access anytime and anywhere to get information about the Entrepreneurial Student Program. For proposers, they can input their proposals directly through this information system and facilitate data storage for managers of the Entrepreneurial Student Program. This system is built using the Yii2 Framework, a component-based PHP programming that has high performance for large-scale web application development. Keywords: Information Systems, Entrepreneurial Student Programs, Yii2 Framework


Author(s):  
Renan Souza ◽  
Marta Mattoso ◽  
Patrick Valduriez

Large-scale workflows that execute on High-Performance Computing machines need to be dynamically steered by users. This means that users analyze big data files, assess key performance indicators, fine-tune parameters, and evaluate the tuning impacts while the workflows generate multiple files, which is challenging. If one does not keep track of such interactions (called user steering actions), it may be impossible to understand the consequences of steering actions and to reproduce the results. This thesis proposes a generic approach to enable tracking user steering actions by characterizing, capturing, relating, and analyzing them by leveraging provenance data management concepts. Experiments with real users show that the approach enabled the understanding of the impact of steering actions while incurring negligible overhead.


2014 ◽  
Author(s):  
R Daniel Kortschak ◽  
David L Adelson

bíogo is a framework designed to ease development and maintenance of computationally intensive bioinformatics applications. The library is written in the Go programming language, a garbage-collected, strictly typed compiled language with built in support for concurrent processing, and performance comparable to C and Java. It provides a variety of data types and utility functions to facilitate manipulation and analysis of large scale genomic and other biological data. bíogo uses a concise and expressive syntax, lowering the barriers to entry for researchers needing to process large data sets with custom analyses while retaining computational safety and ease of code review. We believe bíogo provides an excellent environment for training and research in computational biology because of its combination of strict typing, simple and expressive syntax, and high performance.


2021 ◽  
Author(s):  
◽  
Vahid Arabnejad

<p>Basic science is becoming ever more computationally intensive, increasing the need for large-scale compute and storage resources, be they within a High-Performance Computer cluster, or more recently, within the cloud. Commercial clouds have increasingly become a viable platform for hosting scientific analyses and computation due to their elasticity, recent introduction of specialist hardware, and pay-as-you-go cost model. This computing paradigm therefore presents a low capital and low barrier alternative to operating dedicated eScience infrastructure. Indeed, commercial clouds now enable universal access to capabilities previously available to only large well funded research groups. While the potential benefits of cloud computing are clear, there are still significant technical hurdles associated with obtaining the best execution efficiency whilst trading off cost. In most cases, large scale scientific computation is represented as a workflow for scheduling and runtime provisioning. Such scheduling becomes an even more challenging problem on cloud systems due to the dynamic nature of the cloud, in particular, the elasticity, the pricing models (both static and dynamic), the non-homogeneous resource types and the vast array of services. This mapping of workflow tasks onto a set of provisioned instances is an example of the general scheduling problem and is NP-complete. In addition, certain runtime constraints, the most typical being the cost of the computation and the time which that computation requires to complete, must be met. This thesis addresses 'the scientific workflow scheduling problem in cloud', which is to schedule workflow tasks on cloud resources in a way that users meet their defined constraints such as budget and deadline, and providers maximize profits and resource utilization. Moreover, it explores different mechanisms and strategies for distributing defined constraints over a workflow and investigate its impact on the overall cost of the resulting schedule.</p>


Sign in / Sign up

Export Citation Format

Share Document