workflow systems
Recently Published Documents


TOTAL DOCUMENTS

375
(FIVE YEARS 29)

H-INDEX

33
(FIVE YEARS 3)

2022 ◽  
Vol 14 (1) ◽  
pp. 1-27
Author(s):  
Khalid Belhajjame

Workflows have been adopted in several scientific fields as a tool for the specification and execution of scientific experiments. In addition to automating the execution of experiments, workflow systems often include capabilities to record provenance information, which contains, among other things, data records used and generated by the workflow as a whole but also by its component modules. It is widely recognized that provenance information can be useful for the interpretation, verification, and re-use of workflow results, justifying its sharing and publication among scientists. However, workflow execution in some branches of science can manipulate sensitive datasets that contain information about individuals. To address this problem, we investigate, in this article, the problem of anonymizing the provenance of workflows. In doing so, we consider a popular class of workflows in which component modules use and generate collections of data records as a result of their invocation, as opposed to a single data record. The solution we propose offers guarantees of confidentiality without compromising lineage information, which provides transparency as to the relationships between the data records used and generated by the workflow modules. We provide algorithmic solutions that show how the provenance of a single module and an entire workflow can be anonymized and present the results of experiments that we conducted for their evaluation.


2021 ◽  
Author(s):  
Jung Wook Yang ◽  
Dae Hyun Song ◽  
Hyo Jung An ◽  
Sat Byul Seo

Abstract Identifying the lung carcinoma subtype in small biopsy specimens is an important part of determining a suitable treatment plan but is often challenging without the help of special and/or immunohistochemical stains. Pathology image analysis that tackles this issue would be helpful for diagnoses and subtyping of lung carcinoma. In this study, we developed AI models to classify multinomial patterns of lung carcinoma (adenocarcinoma, squamous cell carcinoma, small cell carcinoma, large cell neuroendocrine carcinoma) and non-neoplastic lung tissue based on convolutional neural networks (CNN or ConvNet). Four CNNs that were pre-trained using transfer learning and one CNN built from scratch were used to classify patch images from pathology whole-slide images (WSIs). We evaluated the diagnostic performance of each model in the test sets. The Xception model achieved the highest performance among pre-trained CNNs with an accuracy of 0.86 and an area under the curve (AUC) of 0.97. The built from scratch CNN model obtained an accuracy of 0.92 and an AUC ranging from 0.99 to 1.00 for subtyping lung carcinoma tasks. These results demonstrate how promising CNN models are for developing improved diagnostic workflow systems for diagnosis and subtyping of lung carcinoma. Of particular note is the fact that the built from scratch CNN described in this paper achieves prompt and consistent results so has the potential to be applied in working hospitals for pathological diagnoses.


2021 ◽  
Author(s):  
Spiros Koulouzis ◽  
Yifang Shi ◽  
Yuandou Wan ◽  
Riccardo Bianchi ◽  
Daniel Kissling ◽  
...  

<p>Airborne Laser Scanning (ALS) data derived from Light Detection And Ranging (LiDAR) technology allow the construction of Essential Biodiversity Variables (EBVs) of ecosystem structure with high resolution at landscape, national and regional scales. Researchers nowadays often process such data, and rapidly prototype using script languages like R or python, and share their experiments via scripts or more recently via notebook environments, such as Jupyter. To scale experiments to large data volumes, extra data sources, or new models, researchers often employ Cloud infrastructures to enhance notebooks (e.g. Jupyter Hub) or execute the experiments as a distributed workflow. In many cases, a researcher has to encapsulate subsets of the code (namely, cells in Jupyter) from the notebook as components to be included in the workflow. However, it is usually time-consuming and a burden for the researcher to encapsulate those components based on the workflow systems' specific interface, where the Findability, Accessibility, Interoperability and Reusability (FAIR) of those components are often limited. We aim to enable the public cloud processing of massive amounts of ALS data across countries and regions and make the retrieval and uptake of such EBV data products of ecosystem structure easily available to a wide scientific community and stakeholders.</p><p> </p><p>We propose and develop a tool called FAIR-Cells, that can be integrated into the Jupyter Lab environment as an extension,  to help scientists and researchers improve the FAIRness of their code. It can encapsulate user-selected cells of code as standardized RESTful API services, and allow users to containerize such Jupyter code cells and to publish them as reusable components via the community repositories.</p><p> </p><p>We demonstrate the features of the FAIR-CELLS using an application from the ecology domain. Ecologists currently process various point cloud datasets derived from LiDAR to extract metrics that capture vegetation's vertical and horizontal structure. A new open-source software called ‘Laserchicken’ allows the processing of country-wide LiDAR datasets in a local environment (e.g. the Dutch national ICT infrastructure called SURF). However, the users have to use the Laserchicken application as a whole to process the LiDAR data. The capacity of the given infrastructure also limits the volume of data. In this work, we will first demonstrate how a user can apply the FAIR-Cells extension to interactively create RESTful services for the components in the Laserchicken software in a Jupyter environment, to automate the encapsulation of those services as Docker containers, and to publish the services in a community catalogue (e.g. LifeWatch) via the API (based on GeoNetwork). We will then demonstrate how those containers can be assembled as a workflow (e.g. using Common Workflow Language) and deployed on the cloud environment (offered by the EOSC early adopter program for ENVRI-FAIR) to process a much bigger dataset than in a local environment. The demonstration results suggest that our approach's technical roadmap can achieve FAIRness and behave good parallelism in large distributed volumes of data when executing the Jupyter-environment-based codes.</p>


Author(s):  
Hatem Elshazly ◽  
Francesc Lordan ◽  
Jorge Ejarque ◽  
Rosa M. Badia

Task-based programming models offer a flexible way to express the unstructured parallelism patterns of nowadays complex applications. This expressive capability is required to achieve maximum possible performance for applications that are executed in distributed execution platforms. In current task-based workflows, tasks are launched for execution when their data dependencies are satisfied. However, even though the data dependencies of a certain task might have been already produced, the execution of this task will be delayed until its predecessor tasks completely finish their execution. As a consequence of this approach of releasing dependencies, the amount of parallelism inherent in applications is limited and performance improvement opportunities are wasted. To mitigate this limitation, we propose an eager approach for releasing data dependencies. Following this approach, the execution of tasks will not be delayed until their predecessor tasks completely finish their execution, instead, tasks will be launched for execution as soon as their data requirements are available. Hence, more parallelism is exposed and applications can achieve higher levels of performance by overlapping the execution of tasks. Towards achieving this goal, in this paper we propose applying two changes to task-based workflow systems. First, modifying the dependency relationships of tasks to be specified not only in terms of predecessor and successor tasks but also in terms of the data that caused these dependencies. Second, triggering the release of dependencies as soon as a predecessor task generates the output data instead of having to wait until the end of the predecessor execution to release all of its dependencies. We realize this proposal using PyCOMPSs: a task-based programming model for parallelizing Python applications. Our experiments show that using an eager approach for releasing dependencies achieves more than 50% performance improvement in the total execution time as compared to the default approach of releasing dependencies.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Marcel Friedrichs

Abstract Data integration plays a vital role in scientific research. In biomedical research, the OMICS fields have shown the need for larger datasets, like proteomics, pharmacogenomics, and newer fields like foodomics. As research projects require multiple data sources, mapping between these sources becomes necessary. Utilized workflow systems and integration tools therefore need to process large amounts of heterogeneous data formats, check for data source updates, and find suitable mapping methods to cross-reference entities from different databases. This article presents BioDWH2, an open-source, graph-based data warehouse and mapping tool, capable of helping researchers with these issues. A workspace centered approach allows project-specific data source selections and Neo4j or GraphQL server tools enable quick access to the database for analysis. The BioDWH2 tools are available to the scientific community at https://github.com/BioDWH2.


GigaScience ◽  
2021 ◽  
Vol 10 (1) ◽  
Author(s):  
Taylor Reiter ◽  
Phillip T Brooks† ◽  
Luiz Irber† ◽  
Shannon E K Joslin† ◽  
Charles M Reid† ◽  
...  

Abstract As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these practices in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.


2020 ◽  
Vol 16 (4) ◽  
pp. 427-449
Author(s):  
Kan Ngamakeur ◽  
Sira Yongchareon

Purpose The paper aims to study realization requirements for the flexible enactment of artifact-centric business processes in a dynamic, collaborative environment and to develop a workflow execution framework that can effectively address those requirements. Design/methodology/approach This study proposed a framework and contract-based, event-driven architecture design and implementation that can directly realize collaborative artifact-centric business processes in service-oriented architecture (SOA) without any model conversion. Findings The results show that the approach is feasible in presenting several key benefits over the use of existing workflow systems to run artifact-centric processes. Originality/value Most of the existing approaches require an artifact-centric model to be transformed into executable workflow languages to run on existing workflow management systems. This study argues that the model conversion can incur losses of information and affect traceability and monitoring ability of workflows, especially in an SOA where a workflow can span across multiple inter-business entities.


Sign in / Sign up

Export Citation Format

Share Document