workflow systems Latest Research Papers

2022 ◽

Vol 14 (1) ◽

pp. 1-27

Author(s):

Khalid Belhajjame

Keyword(s):

Provenance Information ◽

Workflow Systems ◽

Workflow Execution ◽

Scientific Experiments ◽

Data Record ◽

Single Data ◽

Scientific Fields

Workflows have been adopted in several scientific fields as a tool for the specification and execution of scientific experiments. In addition to automating the execution of experiments, workflow systems often include capabilities to record provenance information, which contains, among other things, data records used and generated by the workflow as a whole but also by its component modules. It is widely recognized that provenance information can be useful for the interpretation, verification, and re-use of workflow results, justifying its sharing and publication among scientists. However, workflow execution in some branches of science can manipulate sensitive datasets that contain information about individuals. To address this problem, we investigate, in this article, the problem of anonymizing the provenance of workflows. In doing so, we consider a popular class of workflows in which component modules use and generate collections of data records as a result of their invocation, as opposed to a single data record. The solution we propose offers guarantees of confidentiality without compromising lineage information, which provides transparency as to the relationships between the data records used and generated by the workflow modules. We provide algorithmic solutions that show how the provenance of a single module and an entire workflow can be anonymized and present the results of experiments that we conducted for their evaluation.

Download Full-text

Multinomial Classification of Patterns in Lung Cancer Biopsy Slides Using Customized Convolutional Neural Network

10.21203/rs.3.rs-608551/v1 ◽

2021 ◽

Author(s):

Jung Wook Yang ◽

Dae Hyun Song ◽

Hyo Jung An ◽

Sat Byul Seo

Keyword(s):

Cell Carcinoma ◽

Lung Carcinoma ◽

Treatment Plan ◽

Area Under The Curve ◽

Large Cell ◽

Workflow Systems ◽

Suitable Treatment ◽

Cancer Biopsy ◽

Pathology Image

Abstract Identifying the lung carcinoma subtype in small biopsy specimens is an important part of determining a suitable treatment plan but is often challenging without the help of special and/or immunohistochemical stains. Pathology image analysis that tackles this issue would be helpful for diagnoses and subtyping of lung carcinoma. In this study, we developed AI models to classify multinomial patterns of lung carcinoma (adenocarcinoma, squamous cell carcinoma, small cell carcinoma, large cell neuroendocrine carcinoma) and non-neoplastic lung tissue based on convolutional neural networks (CNN or ConvNet). Four CNNs that were pre-trained using transfer learning and one CNN built from scratch were used to classify patch images from pathology whole-slide images (WSIs). We evaluated the diagnostic performance of each model in the test sets. The Xception model achieved the highest performance among pre-trained CNNs with an accuracy of 0.86 and an area under the curve (AUC) of 0.97. The built from scratch CNN model obtained an accuracy of 0.92 and an AUC ranging from 0.99 to 1.00 for subtyping lung carcinoma tasks. These results demonstrate how promising CNN models are for developing improved diagnostic workflow systems for diagnosis and subtyping of lung carcinoma. Of particular note is the fact that the built from scratch CNN described in this paper achieves prompt and consistent results so has the potential to be applied in working hospitals for pathological diagnoses.

Download Full-text

Enabling “LiDAR data processing” as a service in a Jupyter environment

10.5194/egusphere-egu21-8294 ◽

2021 ◽

Author(s):

Spiros Koulouzis ◽

Yifang Shi ◽

Yuandou Wan ◽

Riccardo Bianchi ◽

Daniel Kissling ◽

...

Keyword(s):

Laser Scanning ◽

Large Data ◽

Local Environment ◽

Lidar Data ◽

Ecosystem Structure ◽

Horizontal Structure ◽

Reusable Components ◽

Workflow Systems ◽

Ict Infrastructure ◽

Cloud Infrastructures

Airborne Laser Scanning (ALS) data derived from Light Detection And Ranging (LiDAR) technology allow the construction of Essential Biodiversity Variables (EBVs) of ecosystem structure with high resolution at landscape, national and regional scales. Researchers nowadays often process such data, and rapidly prototype using script languages like R or python, and share their experiments via scripts or more recently via notebook environments, such as Jupyter. To scale experiments to large data volumes, extra data sources, or new models, researchers often employ Cloud infrastructures to enhance notebooks (e.g. Jupyter Hub) or execute the experiments as a distributed workflow. In many cases, a researcher has to encapsulate subsets of the code (namely, cells in Jupyter) from the notebook as components to be included in the workflow. However, it is usually time-consuming and a burden for the researcher to encapsulate those components based on the workflow systems' specific interface, where the Findability, Accessibility, Interoperability and Reusability (FAIR) of those components are often limited. We aim to enable the public cloud processing of massive amounts of ALS data across countries and regions and make the retrieval and uptake of such EBV data products of ecosystem structure easily available to a wide scientific community and stakeholders.&#160;We propose and develop a tool called FAIR-Cells, that can be integrated into the Jupyter Lab environment as an extension,&#160; to help scientists and researchers improve the FAIRness of their code. It can encapsulate user-selected cells of code as standardized RESTful API services, and allow users to containerize such Jupyter code cells and to publish them as reusable components via the community repositories.&#160;We demonstrate the features of the FAIR-CELLS using an application from the ecology domain. Ecologists currently process various point cloud datasets derived from LiDAR to extract metrics that capture vegetation's vertical and horizontal structure. A new open-source software called &#8216;Laserchicken&#8217; allows the processing of country-wide LiDAR datasets in a local environment (e.g. the Dutch national ICT infrastructure called SURF). However, the users have to use the Laserchicken application as a whole to process the LiDAR data. The capacity of the given infrastructure also limits the volume of data. In this work, we will first demonstrate how a user can apply the FAIR-Cells extension to interactively create RESTful services for the components in the Laserchicken software in a Jupyter environment, to automate the encapsulation of those services as Docker containers, and to publish the services in a community catalogue (e.g. LifeWatch) via the API (based on GeoNetwork). We will then demonstrate how those containers can be assembled as a workflow (e.g. using Common Workflow Language) and deployed on the cloud environment (offered by the EOSC early adopter program for ENVRI-FAIR) to process a much bigger dataset than in a local environment. The demonstration results suggest that our approach's technical roadmap can achieve FAIRness and behave good parallelism in large distributed volumes of data when executing the Jupyter-environment-based codes.

Download Full-text

Accelerated execution via eager-release of dependencies in task-based workflows

The International Journal of High Performance Computing Applications ◽

10.1177/1094342021997558 ◽

2021 ◽

pp. 109434202199755

Author(s):

Hatem Elshazly ◽

Francesc Lordan ◽

Jorge Ejarque ◽

Rosa M. Badia

Keyword(s):

Performance Improvement ◽

Programming Model ◽

Total Execution Time ◽

Output Data ◽

Data Dependencies ◽

Workflow Systems ◽

Current Task ◽

Distributed Execution ◽

And Performance ◽

Data Requirements

Task-based programming models offer a flexible way to express the unstructured parallelism patterns of nowadays complex applications. This expressive capability is required to achieve maximum possible performance for applications that are executed in distributed execution platforms. In current task-based workflows, tasks are launched for execution when their data dependencies are satisfied. However, even though the data dependencies of a certain task might have been already produced, the execution of this task will be delayed until its predecessor tasks completely finish their execution. As a consequence of this approach of releasing dependencies, the amount of parallelism inherent in applications is limited and performance improvement opportunities are wasted. To mitigate this limitation, we propose an eager approach for releasing data dependencies. Following this approach, the execution of tasks will not be delayed until their predecessor tasks completely finish their execution, instead, tasks will be launched for execution as soon as their data requirements are available. Hence, more parallelism is exposed and applications can achieve higher levels of performance by overlapping the execution of tasks. Towards achieving this goal, in this paper we propose applying two changes to task-based workflow systems. First, modifying the dependency relationships of tasks to be specified not only in terms of predecessor and successor tasks but also in terms of the data that caused these dependencies. Second, triggering the release of dependencies as soon as a predecessor task generates the output data instead of having to wait until the end of the predecessor execution to release all of its dependencies. We realize this proposal using PyCOMPSs: a task-based programming model for parallelizing Python applications. Our experiments show that using an eager approach for releasing dependencies achieves more than 50% performance improvement in the total execution time as compared to the default approach of releasing dependencies.

Download Full-text

BioDWH2: an automated graph-based data warehouse and mapping tool

Journal of Integrative Bioinformatics ◽

10.1515/jib-2020-0033 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Marcel Friedrichs

Keyword(s):

Data Warehouse ◽

Heterogeneous Data ◽

Vital Role ◽

Specific Data ◽

Workflow Systems ◽

Mapping Tool ◽

Data Formats ◽

Multiple Data ◽

Data Source ◽

Cross Reference

Abstract Data integration plays a vital role in scientific research. In biomedical research, the OMICS fields have shown the need for larger datasets, like proteomics, pharmacogenomics, and newer fields like foodomics. As research projects require multiple data sources, mapping between these sources becomes necessary. Utilized workflow systems and integration tools therefore need to process large amounts of heterogeneous data formats, check for data source updates, and find suitable mapping methods to cross-reference entities from different databases. This article presents BioDWH2, an open-source, graph-based data warehouse and mapping tool, capable of helping researchers with these issues. A workspace centered approach allows project-specific data source selections and Neo4j or GraphQL server tools enable quick access to the database for analysis. The BioDWH2 tools are available to the scientific community at https://github.com/BioDWH2.

Download Full-text

Streamlining data-intensive biology with workflow systems

GigaScience ◽

10.1093/gigascience/giaa140 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Taylor Reiter ◽

Phillip T Brooks† ◽

Luiz Irber† ◽

Shannon E K Joslin† ◽

Charles M Reid† ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Biological Data ◽

Data Generation ◽

Sequencing Data ◽

Workflow Systems ◽

Data Intensive ◽

High Throughput Sequencing Data ◽

Project Data

Abstract As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these practices in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.

Download Full-text

Maximizing Reliability of Data-Intensive Workflow Systems with Active Fault Tolerance Schemes in Cloud

2020 IEEE 13th International Conference on Cloud Computing (CLOUD) ◽

10.1109/cloud49709.2020.00068 ◽

2020 ◽

Author(s):

Weiling Li ◽

Xiaoning Sun ◽

Kewen Liao ◽

Yunni Xia ◽

Feifei Chen ◽

...

Keyword(s):

Fault Tolerance ◽

Active Fault ◽

Workflow Systems ◽

Data Intensive

Download Full-text

A contract-based workflow execution framework for realizing artifact-centric business processes in a dynamic and collaborative environment

International Journal of Web Information Systems ◽

10.1108/ijwis-04-2020-0020 ◽

2020 ◽

Vol 16 (4) ◽

pp. 427-449

Author(s):

Kan Ngamakeur ◽

Sira Yongchareon

Keyword(s):

Business Processes ◽

Service Oriented Architecture ◽

Workflow Management ◽

Content Type ◽

Workflow Systems ◽

Workflow Execution ◽

Collaborative Environment ◽

Service Oriented ◽

Business Entities ◽

Model Conversion

Purpose The paper aims to study realization requirements for the flexible enactment of artifact-centric business processes in a dynamic, collaborative environment and to develop a workflow execution framework that can effectively address those requirements. Design/methodology/approach This study proposed a framework and contract-based, event-driven architecture design and implementation that can directly realize collaborative artifact-centric business processes in service-oriented architecture (SOA) without any model conversion. Findings The results show that the approach is feasible in presenting several key benefits over the use of existing workflow systems to run artifact-centric processes. Originality/value Most of the existing approaches require an artifact-centric model to be transformed into executable workflow languages to run on existing workflow management systems. This study argues that the model conversion can incur losses of information and affect traceability and monitoring ability of workflows, especially in an SOA where a workflow can span across multiple inter-business entities.

Download Full-text

End-user feedback in multi-user workflow systems

IFL 2020: Proceedings of the 32nd Symposium on Implementation and Application of Functional Languages ◽

10.1145/3462172.3462188 ◽

2020 ◽

Author(s):

Nico Naus ◽

Johan Jeuring

Keyword(s):

User Feedback ◽

End User ◽

Workflow Systems

Download Full-text

Assisting End Users in Workflow Systems

10.33540/115 ◽

2020 ◽

Author(s):

◽

Nico Naus

Keyword(s):

End Users ◽

Workflow Systems

Download Full-text

workflow systems
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

On the Anonymization of Workflow Provenance without Compromising the Transparency of Lineage

Multinomial Classification of Patterns in Lung Cancer Biopsy Slides Using Customized Convolutional Neural Network

Enabling “LiDAR data processing” as a service in a Jupyter environment

Accelerated execution via eager-release of dependencies in task-based workflows

BioDWH2: an automated graph-based data warehouse and mapping tool

Streamlining data-intensive biology with workflow systems

Maximizing Reliability of Data-Intensive Workflow Systems with Active Fault Tolerance Schemes in Cloud

A contract-based workflow execution framework for realizing artifact-centric business processes in a dynamic and collaborative environment

End-user feedback in multi-user workflow systems

Assisting End Users in Workflow Systems

Export Citation Format

workflow systemsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

On the Anonymization of Workflow Provenance without Compromising the Transparency of Lineage

Multinomial Classification of Patterns in Lung Cancer Biopsy Slides Using Customized Convolutional Neural Network

Enabling “LiDAR data processing” as a service in a Jupyter environment

Accelerated execution via eager-release of dependencies in task-based workflows

BioDWH2: an automated graph-based data warehouse and mapping tool

Streamlining data-intensive biology with workflow systems

Maximizing Reliability of Data-Intensive Workflow Systems with Active Fault Tolerance Schemes in Cloud

A contract-based workflow execution framework for realizing artifact-centric business processes in a dynamic and collaborative environment

End-user feedback in multi-user workflow systems

Assisting End Users in Workflow Systems

workflow systems
Recently Published Documents