data lineage
Recently Published Documents


TOTAL DOCUMENTS

47
(FIVE YEARS 15)

H-INDEX

7
(FIVE YEARS 1)

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Paul Billing Ross ◽  
Jina Song ◽  
Philip S. Tsao ◽  
Cuiping Pan

AbstractBiomedical studies have become larger in size and yielded large quantities of data, yet efficient data processing remains a challenge. Here we present Trellis, a cloud-based data and task management framework that completely automates the process from data ingestion to result presentation, while tracking data lineage, facilitating information query, and supporting fault-tolerance and scalability. Using a graph database to coordinate the state of the data processing workflows and a scalable microservice architecture to perform bioinformatics tasks, Trellis has enabled efficient variant calling on 100,000 human genomes collected in the VA Million Veteran Program.


Author(s):  
Alexander Schoenenwald ◽  
Simon Kern ◽  
Josef Viehhauser ◽  
Johannes Schildgen

AbstractMetadata management constitutes a key prerequisite for enterprises as they engage in data analytics and governance. Today, however, the context of data is often only manually documented by subject matter experts, and lacks completeness and reliability due to the complex nature of data pipelines. Thus, collecting data lineage—describing the origin, structure, and dependencies of data—in an automated fashion increases quality of provided metadata and reduces manual effort, making it critical for the development and operation of data pipelines. In our practice report, we propose an end-to-end solution that digests lineage via (Py‑)Spark execution plans. We build upon the open-source component Spline, allowing us to reliably consume lineage metadata and identify interdependencies. We map the digested data into an expandable data model, enabling us to extract graph structures for both coarse- and fine-grained data lineage. Lastly, our solution visualizes the extracted data lineage via a modern web app, and integrates with BMW Group’s soon-to-be open-sourced Cloud Data Hub.


2021 ◽  
pp. 5-19
Author(s):  
Jens Freche ◽  
Milan den Heijer ◽  
Bastian Wormuth
Keyword(s):  

2021 ◽  
Vol 43 ◽  
Author(s):  
Júlia Camargos da Costa ◽  
Everson Reis Carvalho ◽  
Izabel Costa Silva Neta ◽  
Milena Christy Santos ◽  
Luciano Dias Cabral Neto ◽  
...  

Abstract: This study aimed to evaluate the effect of genetic composition and arrangement between female and male parents on tolerance to delayed drying of maize seeds, evaluating the physiological quality and enzyme expression. Ears were harvested close to the stage of physiological maturity (around 35% moisture) and the genotypes were identified as line 1 (L1), line 2 (L2), the hybrid (HB - ♀L1 and ♂L2), and the reciprocal hybrid (HR - ♀L2 and ♂L1). For assessment of physiological quality, CDR (4x6x2) was used, consisting of four genotypes, six times of delay before artificial drying (10, 18, 24, 28, 32, and 40 hours), and two drying delay temperatures (42 and 48 °C). DIC (4x3) was used for enzymatic expression, consisting of four genotypes and three delay times before artificial drying (10, 24 and 40 hours) at 48 °C. Analysis of variance F (p < 0.05), Tukey’s test (p < 0.05), and analysis of polynomial regressions were performed on the data. Lineage arrangement affects seed tolerance to drying delay. Therefore, susceptible lines should not be used as female parents. The seeds of the line most susceptible (L2) to delay in drying exhibit less expression of α-amylase (α-AM).


2020 ◽  
Author(s):  
Leonardo Guerreiro Azevedo ◽  
Renan Souza ◽  
Raphael Melo Thiago ◽  
Elton Soares ◽  
Marcio Moreno

Machine Learning (ML) is a core concept behind Artificial Intelligence systems, which work driven by data and generate ML models. These models are used for decision making, and it is crucial to trust their outputs by, e.g., understanding the process that derives them. One way to explain the derivation of ML models is by tracking the whole ML lifecycle, generating its data lineage, which may be accomplished by provenance data management techniques. In this work, we present the use of ProvLake tool for ML provenance data management in the ML lifecycle for Well Top Picking, an essential process in Oil and Gas exploration. We show how ProvLake supported the validation of ML models, the understanding of whether the ML models generalize respecting the domain characteristics, and their derivation.


First Break ◽  
2020 ◽  
Vol 38 (7) ◽  
pp. 89-93
Author(s):  
Leonardo Guerreiro Azevedo ◽  
Renan Souza ◽  
Rafael Brandão ◽  
Vítor N. Lourenço ◽  
Marcelo Costalonga ◽  
...  

2020 ◽  
Author(s):  
David Schäfer ◽  
Bert Palm ◽  
Lennart Schmidt ◽  
Peter Lünenschloß ◽  
Jan Bumberger

&lt;p&gt;The number of sensors used in the environmental system sciences is increasing rapidly, and while this trend undoubtedly provides a great potential to broaden the understanding of complex spatio-temporal processes, it comes with its own set of new challenges. The flow of data from a source to its sink, from sensors to databases, involves many, usually error prone intermediate steps. From the data acquisition with its specific scientific and technical challenges, over the data transfer from often remote locations to the final data processing, all carry great potential to introduce errors and disturbances into the actual environmental signal.&lt;/p&gt;&lt;p&gt;Quantifying these errors becomes a crucial part of the later evaluation of all measured data. While many large environmental observatories are moving from manual to more automated ways of data processing and quality assurance, these systems are usually highly customized and hand written. This approach is non-ideal in several ways: First, it is a waste of resources as the same algorithms are implemented over and over again and second, it imposes great challenges to reproducibility. If the relevant programs are made available at all, they expose all problems of software reuse: correctness of the implementation, readability and comprehensibility for future users, as well as transferability between different computing environments. Beside these problems, related to software development in general, another crucial factor comes into play: the end product, a processed and quality controlled data set, is closely tied to the current version of the programs in use. Even small changes to the source code can lead to vastly differing results. If this is not approached responsibly, data and programs will inevitably fall out of sync.&lt;/p&gt;&lt;p&gt;The presented software, the 'System for automated Quality Control (SaQC)' (www.ufz.git.de/rdm-software/saqc), helps to either solve, or massively simplify the solution to the presented challenges. As a mainly no-code platform with a large set of implemented functionality, SaQC lowers the entry barrier for the non-programming scientific practitioner, without sacrificing the possibilities to fine-grained adaptation to project specific needs. The text based configuration allows the easy integration into version control systems and thus opens the opportunity to use well established software for data lineage. We will give a short overview of the program's unique features and showcase possibilities to build reliable and reproducible processing and quality assurance pipelines for real-world data from a spatially distributed, heterogeneous sensor network.&lt;/p&gt;


2020 ◽  
Author(s):  
R.M. Thiago ◽  
R. Souza ◽  
L. Azevedo ◽  
E. Figueiredo De Souza Soares ◽  
R. Santos ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document