data lineage Latest Research Papers

AbstractBiomedical studies have become larger in size and yielded large quantities of data, yet efficient data processing remains a challenge. Here we present Trellis, a cloud-based data and task management framework that completely automates the process from data ingestion to result presentation, while tracking data lineage, facilitating information query, and supporting fault-tolerance and scalability. Using a graph database to coordinate the state of the data processing workflows and a scalable microservice architecture to perform bioinformatics tasks, Trellis has enabled efficient variant calling on 100,000 human genomes collected in the VA Million Veteran Program.

Download Full-text

Collecting and visualizing data lineage of Spark jobs

Datenbank-Spektrum ◽

10.1007/s13222-021-00387-7 ◽

2021 ◽

Author(s):

Alexander Schoenenwald ◽

Simon Kern ◽

Josef Viehhauser ◽

Johannes Schildgen

Keyword(s):

Data Model ◽

Data Analytics ◽

Complex Nature ◽

Subject Matter Experts ◽

Data Lineage ◽

Cloud Data ◽

Fine Grained ◽

Source Component ◽

Web App

AbstractMetadata management constitutes a key prerequisite for enterprises as they engage in data analytics and governance. Today, however, the context of data is often only manually documented by subject matter experts, and lacks completeness and reliability due to the complex nature of data pipelines. Thus, collecting data lineage—describing the origin, structure, and dependencies of data—in an automated fashion increases quality of provided metadata and reduces manual effort, making it critical for the development and operation of data pipelines. In our practice report, we propose an end-to-end solution that digests lineage via (Py‑)Spark execution plans. We build upon the open-source component Spline, allowing us to reliably consume lineage metadata and identify interdependencies. We map the digested data into an expandable data model, enabling us to extract graph structures for both coarse- and fine-grained data lineage. Lastly, our solution visualizes the extracted data lineage via a modern web app, and integrates with BMW Group’s soon-to-be open-sourced Cloud Data Hub.

Download Full-text

Moonlight: A Push-based API for Tracking Data Lineage in Modern ETL processes

2021 20th International Symposium INFOTEH-JAHORINA (INFOTEH) ◽

10.1109/infoteh51037.2021.9400667 ◽

2021 ◽

Author(s):

Marko Jamedzija ◽

Zoran Duric

Keyword(s):

Tracking Data ◽

Data Lineage

Download Full-text

Data Lineage

10.1007/978-3-030-78821-6_1 ◽

2021 ◽

pp. 5-19

Author(s):

Jens Freche ◽

Milan den Heijer ◽

Bastian Wormuth

Keyword(s):

Data Lineage

Download Full-text

Tolerance to delay in drying of hybrid maize seeds related to parental line and temperature

Journal of Seed Science ◽

10.1590/2317-1545v43253630 ◽

2021 ◽

Vol 43 ◽

Author(s):

Júlia Camargos da Costa ◽

Everson Reis Carvalho ◽

Izabel Costa Silva Neta ◽

Milena Christy Santos ◽

Luciano Dias Cabral Neto ◽

...

Keyword(s):

Parental Line ◽

Genetic Composition ◽

Enzyme Expression ◽

Data Lineage ◽

Maize Seeds ◽

Delay Times ◽

Physiological Quality ◽

Hybrid Maize ◽

Enzymatic Expression ◽

Polynomial Regressions

Abstract: This study aimed to evaluate the effect of genetic composition and arrangement between female and male parents on tolerance to delayed drying of maize seeds, evaluating the physiological quality and enzyme expression. Ears were harvested close to the stage of physiological maturity (around 35% moisture) and the genotypes were identified as line 1 (L1), line 2 (L2), the hybrid (HB - ♀L1 and ♂L2), and the reciprocal hybrid (HR - ♀L2 and ♂L1). For assessment of physiological quality, CDR (4x6x2) was used, consisting of four genotypes, six times of delay before artificial drying (10, 18, 24, 28, 32, and 40 hours), and two drying delay temperatures (42 and 48 °C). DIC (4x3) was used for enzymatic expression, consisting of four genotypes and three delay times before artificial drying (10, 24 and 40 hours) at 48 °C. Analysis of variance F (p < 0.05), Tukey’s test (p < 0.05), and analysis of polynomial regressions were performed on the data. Lineage arrangement affects seed tolerance to drying delay. Therefore, susceptible lines should not be used as female parents. The seeds of the line most susceptible (L2) to delay in drying exhibit less expression of α-amylase (α-AM).

Download Full-text

Experiencing ProvLake to Manage the Data Lineage of AI Workflows

10.5753/sbsi.2020.13144 ◽

2020 ◽

Author(s):

Leonardo Guerreiro Azevedo ◽

Renan Souza ◽

Raphael Melo Thiago ◽

Elton Soares ◽

Marcio Moreno

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Data Management ◽

Oil And Gas ◽

Core Concept ◽

Data Lineage ◽

Oil And Gas Exploration ◽

Provenance Data ◽

Management Techniques ◽

Artificial Intelligence Systems

Machine Learning (ML) is a core concept behind Artificial Intelligence systems, which work driven by data and generate ML models. These models are used for decision making, and it is crucial to trust their outputs by, e.g., understanding the process that derives them. One way to explain the derivation of ML models is by tracking the whole ML lifecycle, generating its data lineage, which may be accomplished by provenance data management techniques. In this work, we present the use of ProvLake tool for ML provenance data management in the ML lifecycle for Well Top Picking, an essential process in Oil and Gas exploration. We show how ProvLake supported the validation of ML models, the understanding of whether the ML models generalize respecting the domain characteristics, and their derivation.

Download Full-text

A Column-Level Data Lineage Processing System Based on Hive

Proceedings of the 2020 3rd International Conference on Big Data Technologies ◽

10.1145/3422713.3422719 ◽

2020 ◽

Author(s):

Zehua Tan ◽

Haihong E ◽

Meina Song

Keyword(s):

Processing System ◽

Data Lineage ◽

Level Data

Download Full-text

Adding Hyperknowledge-enabled data lineage to a machine learning workflow management system for oil and gas

First Break ◽

10.3997/1365-2397.fb2020055 ◽

2020 ◽

Vol 38 (7) ◽

pp. 89-93

Author(s):

Leonardo Guerreiro Azevedo ◽

Renan Souza ◽

Rafael Brandão ◽

Vítor N. Lourenço ◽

Marcelo Costalonga ◽

...

Keyword(s):

Machine Learning ◽

Management System ◽

Oil And Gas ◽

Workflow Management ◽

Workflow Management System ◽

Data Lineage

Download Full-text

From source to sink - Sustainable and reproducible data pipelines with SaQC

10.5194/egusphere-egu2020-19648 ◽

2020 ◽

Author(s):

David Schäfer ◽

Bert Palm ◽

Lennart Schmidt ◽

Peter Lünenschloß ◽

Jan Bumberger

Keyword(s):

Quality Assurance ◽

Data Processing ◽

Software Reuse ◽

Data Transfer ◽

Large Set ◽

Real World Data ◽

Data Set ◽

Entry Barrier ◽

Data Lineage ◽

Source To Sink

The number of sensors used in the environmental system sciences is increasing rapidly, and while this trend undoubtedly provides a great potential to broaden the understanding of complex spatio-temporal processes, it comes with its own set of new challenges. The flow of data from a source to its sink, from sensors to databases, involves many, usually error prone intermediate steps. From the data acquisition with its specific scientific and technical challenges, over the data transfer from often remote locations to the final data processing, all carry great potential to introduce errors and disturbances into the actual environmental signal.Quantifying these errors becomes a crucial part of the later evaluation of all measured data. While many large environmental observatories are moving from manual to more automated ways of data processing and quality assurance, these systems are usually highly customized and hand written. This approach is non-ideal in several ways: First, it is a waste of resources as the same algorithms are implemented over and over again and second, it imposes great challenges to reproducibility. If the relevant programs are made available at all, they expose all problems of software reuse: correctness of the implementation, readability and comprehensibility for future users, as well as transferability between different computing environments. Beside these problems, related to software development in general, another crucial factor comes into play: the end product, a processed and quality controlled data set, is closely tied to the current version of the programs in use. Even small changes to the source code can lead to vastly differing results. If this is not approached responsibly, data and programs will inevitably fall out of sync.The presented software, the 'System for automated Quality Control (SaQC)' (www.ufz.git.de/rdm-software/saqc), helps to either solve, or massively simplify the solution to the presented challenges. As a mainly no-code platform with a large set of implemented functionality, SaQC lowers the entry barrier for the non-programming scientific practitioner, without sacrificing the possibilities to fine-grained adaptation to project specific needs. The text based configuration allows the easy integration into version control systems and thus opens the opportunity to use well established software for data lineage. We will give a short overview of the program's unique features and showcase possibilities to build reliable and reproducible processing and quality assurance pipelines for real-world data from a spatially distributed, heterogeneous sensor network.

Download Full-text

Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case

10.3997/2214-4609.202032075 ◽

2020 ◽

Author(s):

R.M. Thiago ◽

R. Souza ◽

L. Azevedo ◽

E. Figueiredo De Souza Soares ◽

R. Santos ◽

...

Keyword(s):

Machine Learning ◽

Use Case ◽

Sweet Spot ◽

Learning Models ◽

Data Lineage ◽

Machine Learning Models

Download Full-text

data lineage
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Trellis for efficient data and task management in the VA Million Veteran Program

Collecting and visualizing data lineage of Spark jobs

Moonlight: A Push-based API for Tracking Data Lineage in Modern ETL processes

Data Lineage

Tolerance to delay in drying of hybrid maize seeds related to parental line and temperature

Experiencing ProvLake to Manage the Data Lineage of AI Workflows

A Column-Level Data Lineage Processing System Based on Hive

Adding Hyperknowledge-enabled data lineage to a machine learning workflow management system for oil and gas

From source to sink - Sustainable and reproducible data pipelines with SaQC

Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case

Export Citation Format

data lineageRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Trellis for efficient data and task management in the VA Million Veteran Program

Collecting and visualizing data lineage of Spark jobs

Moonlight: A Push-based API for Tracking Data Lineage in Modern ETL processes

Data Lineage

Tolerance to delay in drying of hybrid maize seeds related to parental line and temperature

Experiencing ProvLake to Manage the Data Lineage of AI Workflows

A Column-Level Data Lineage Processing System Based on Hive

Adding Hyperknowledge-enabled data lineage to a machine learning workflow management system for oil and gas

From source to sink - Sustainable and reproducible data pipelines with SaQC

Managing Data Lineage of O&G Machine Learning Models: The Sweet Spot for Shale Use Case

data lineage
Recently Published Documents