Leveraging data lineage to infer logical relationships between astronomical catalogs

The number of sensors used in the environmental system sciences is increasing rapidly, and while this trend undoubtedly provides a great potential to broaden the understanding of complex spatio-temporal processes, it comes with its own set of new challenges. The flow of data from a source to its sink, from sensors to databases, involves many, usually error prone intermediate steps. From the data acquisition with its specific scientific and technical challenges, over the data transfer from often remote locations to the final data processing, all carry great potential to introduce errors and disturbances into the actual environmental signal.Quantifying these errors becomes a crucial part of the later evaluation of all measured data. While many large environmental observatories are moving from manual to more automated ways of data processing and quality assurance, these systems are usually highly customized and hand written. This approach is non-ideal in several ways: First, it is a waste of resources as the same algorithms are implemented over and over again and second, it imposes great challenges to reproducibility. If the relevant programs are made available at all, they expose all problems of software reuse: correctness of the implementation, readability and comprehensibility for future users, as well as transferability between different computing environments. Beside these problems, related to software development in general, another crucial factor comes into play: the end product, a processed and quality controlled data set, is closely tied to the current version of the programs in use. Even small changes to the source code can lead to vastly differing results. If this is not approached responsibly, data and programs will inevitably fall out of sync.The presented software, the 'System for automated Quality Control (SaQC)' (www.ufz.git.de/rdm-software/saqc), helps to either solve, or massively simplify the solution to the presented challenges. As a mainly no-code platform with a large set of implemented functionality, SaQC lowers the entry barrier for the non-programming scientific practitioner, without sacrificing the possibilities to fine-grained adaptation to project specific needs. The text based configuration allows the easy integration into version control systems and thus opens the opportunity to use well established software for data lineage. We will give a short overview of the program's unique features and showcase possibilities to build reliable and reproducible processing and quality assurance pipelines for real-world data from a spatially distributed, heterogeneous sensor network.

Download Full-text

RFIDSLT: A Data Lineage Tracing Method for Complex Query over RFID Streams

2009 IEEE International Conference on e-Business Engineering ◽

10.1109/icebe.2009.39 ◽

2009 ◽

Author(s):

Yongli Wang ◽

Jiang-Bo Qian ◽

Ran Ma

Keyword(s):

Lineage Tracing ◽

Complex Query ◽

Data Lineage ◽

Tracing Method

Download Full-text

Experiencing ProvLake to Manage the Data Lineage of AI Workflows

10.5753/sbsi.2020.13144 ◽

2020 ◽

Author(s):

Leonardo Guerreiro Azevedo ◽

Renan Souza ◽

Raphael Melo Thiago ◽

Elton Soares ◽

Marcio Moreno

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Data Management ◽

Oil And Gas ◽

Core Concept ◽

Data Lineage ◽

Oil And Gas Exploration ◽

Provenance Data ◽

Management Techniques ◽

Artificial Intelligence Systems

Machine Learning (ML) is a core concept behind Artificial Intelligence systems, which work driven by data and generate ML models. These models are used for decision making, and it is crucial to trust their outputs by, e.g., understanding the process that derives them. One way to explain the derivation of ML models is by tracking the whole ML lifecycle, generating its data lineage, which may be accomplished by provenance data management techniques. In this work, we present the use of ProvLake tool for ML provenance data management in the ML lifecycle for Well Top Picking, an essential process in Oil and Gas exploration. We show how ProvLake supported the validation of ML models, the understanding of whether the ML models generalize respecting the domain characteristics, and their derivation.

Download Full-text

The use of machine-readable astronomical catalogs at small observatories

Symposium - International Astronomical Union ◽

10.1017/s0074180900151691 ◽

1986 ◽

Vol 118 ◽

pp. 321-322

Author(s):

Wayne H. Warren

Keyword(s):

Data Centers ◽

Astronomical Data ◽

Disk Storage ◽

Astronomical Catalogs ◽

The World ◽

Readable Form ◽

Computer Controlled ◽

Machine Readable ◽

And Storage ◽

Machine Readable Form

The development of computer controlled telescopes at small observatories has dramatically increased the demand for and potential usefulness of astronomical catalogs in machine-readable form. The compilation and storage of catalogs containing program and standard stars are obvious necessities for the operation of an automatic telescope, but to date most observers have been collecting their own data and manually entering them into microcomputer disk storage. (This is clear from the small number of machine catalogs distributed by the ADC to smaller observatories.) Astronomical data centers located in several countries around the world currently archive, maintain and disseminate a wide variety of machine catalogs in virtually every discipline of astronomy, and these facilities can provide observers with nearly any kind of data needed for controlling telescopes (positional catalogs), reducing data (catalogs of all types of photometry, spectroscopy, etc.) and providing access to fundamental quantities needed for the interpretation of observations (catalogs of binaries, variables, radial and rotational velocities, etc.). The ADC presently has approximately 450 machine catalogs in its archives and these are available to observatories upon request. Procedures for obtaining data from the ADC and policies for distribution are described in this paper, while a list of all catalogs available can be obtained by contacting the ADC.

Download Full-text