Datenbank-Spektrum | ScienceGate

Aggregate-based Training Phase for ML-based Cardinality Estimation

Datenbank-Spektrum ◽

10.1007/s13222-021-00400-z ◽

2022 ◽

Author(s):

Lucas Woltmann ◽

Claudio Hartmann ◽

Dirk Habich ◽

Wolfgang Lehner

Keyword(s):

Training Phase ◽

Aggregated Data ◽

Cardinality Estimation ◽

Speed Up ◽

Query Structure ◽

Database Query Processing ◽

Model Training ◽

Core Idea ◽

Traditional Approaches ◽

Base Data

AbstractCardinality estimation is a fundamental task in database query processing and optimization. As shown in recent papers, machine learning (ML)-based approaches may deliver more accurate cardinality estimations than traditional approaches. However, a lot of training queries have to be executed during the model training phase to learn a data-dependent ML model making it very time-consuming. Many of those training or example queries use the same base data, have the same query structure, and only differ in their selective predicates. To speed up the model training phase, our core idea is to determine a predicate-independent pre-aggregation of the base data and to execute the example queries over this pre-aggregated data. Based on this idea, we present a specific aggregate-based training phase for ML-based cardinality estimation approaches in this paper. As we are going to show with different workloads in our evaluation, we are able to achieve an average speedup of 90 with our aggregate-based training phase and thus outperform indexes.

Download Full-text

Four Generations in Data Engineering for Data Science

Datenbank-Spektrum ◽

10.1007/s13222-021-00399-3 ◽

2021 ◽

Author(s):

Meike Klettke ◽

Uta Störl

Keyword(s):

Data Science ◽

Data Curation ◽

Data Driven ◽

Environmental Sciences ◽

Scientific Methods ◽

Domain Experts ◽

The Past ◽

Research Fields ◽

Data Engineering ◽

The Moment

AbstractData-driven methods and data science are important scientific methods in many research fields. All data science approaches require professional data engineering components. At the moment, computer science experts are needed for solving these data engineering tasks. Simultaneously, scientists from many fields (like natural sciences, medicine, environmental sciences, and engineering) want to analyse their data autonomously. The arising task for data engineering is the development of tools that can support an automated data curation and are utilisable for domain experts. In this article, we will introduce four generations of data engineering approaches classifying the data engineering technologies of the past and presence. We will show which data engineering tools are needed for the scientific landscape of the next decade.

Download Full-text

Data Science Meets High-Tech Manufacturing – The BTW 2021 Data Science Challenge

Datenbank-Spektrum ◽

10.1007/s13222-021-00398-4 ◽

2021 ◽

Author(s):

Lucas Woltmann ◽

Peter Volk ◽

Michael Dinzinger ◽

Lukas Gräf ◽

Sebastian Strasser ◽

...

Keyword(s):

Real World ◽

Data Science ◽

Database Systems ◽

Production Equipment ◽

High Tech ◽

Real World Data ◽

Winning Team ◽

Semiconductor Manufacturer ◽

Problem Description ◽

Large Production

AbstractFor its third installment, the Data Science Challenge of the 19th symposium “Database Systems for Business, Technology and Web” (BTW) of the Gesellschaft für Informatik (GI) tackled the problem of predictive energy management in large production facilities. For the first time, this year’s challenge was organized as a cooperation between Technische Universität Dresden, GlobalFoundries, and ScaDS.AI Dresden/Leipzig. The Challenge’s participants were given real-world production and energy data from the semiconductor manufacturer GlobalFoundries and had to solve the problem of predicting the energy consumption for production equipment. The usage of real-world data gave the participants a hands-on experience of challenges in Big Data integration and analysis. After a leaderboard-based preselection round, the accepted participants presented their approach to an expert jury and audience in a hybrid format. In this article, we give an overview of the main points of the Data Science Challenge, like organization and problem description. Additionally, the winning team presents its solution.

Download Full-text

The Collaborative Research Center FONDA

Datenbank-Spektrum ◽

10.1007/s13222-021-00397-5 ◽

2021 ◽

Author(s):

Ulf Leser ◽

Marcus Hilbrich ◽

Claudia Draxl ◽

Peter Eisert ◽

Lars Grunske ◽

...

Keyword(s):

Collaborative Research ◽

Research Center

Download Full-text

Continuous Training and Deployment of Deep Learning Models

Datenbank-Spektrum ◽

10.1007/s13222-021-00386-8 ◽

2021 ◽

Author(s):

Ioannis Prapas ◽

Behrouz Derakhshan ◽

Alireza Rezaei Mahdiraji ◽

Volker Markl

Keyword(s):

Deep Learning ◽

Historical Data ◽

State Of The Art ◽

Streaming Data ◽

Superior Performance ◽

Learning Models ◽

Model Quality ◽

Continuous Training ◽

Training Time ◽

Machine Learning Methods

AbstractDeep Learning (DL) has consistently surpassed other Machine Learning methods and achieved state-of-the-art performance in multiple cases. Several modern applications like financial and recommender systems require models that are constantly updated with fresh data. The prominent approach for keeping a DL model fresh is to trigger full retraining from scratch when enough new data are available. However, retraining large and complex DL models is time-consuming and compute-intensive. This makes full retraining costly, wasteful, and slow. In this paper, we present an approach to continuously train and deploy DL models. First, we enable continuous training through proactive training that combines samples of historical data with new streaming data. Second, we enable continuous deployment through gradient sparsification that allows us to send a small percentage of the model updates per training iteration. Our experimental results with LeNet5 on MNIST and modern DL models on CIFAR-10 show that proactive training keeps models fresh with comparable—if not superior—performance to full retraining at a fraction of the time. Combined with gradient sparsification, sparse proactive training enables very fast updates of a deployed model with arbitrarily large sparsity, reducing communication per iteration up to four orders of magnitude, with minimal—if any—losses in model quality. Sparse training, however, comes at a price; it incurs overhead on the training that depends on the size of the model and increases the training time by factors ranging from 1.25 to 3 in our experiments. Arguably, a small price to pay for successfully enabling the continuous training and deployment of large DL models.

Download Full-text

„Data Engineering“ in der Hochschullehre

Datenbank-Spektrum ◽

10.1007/s13222-021-00395-7 ◽

2021 ◽

Author(s):

Ralf Schenkel ◽

Stefanie Scherzinger ◽

Marina Tropmann-Frick

Keyword(s):

Data Science ◽

Data Engineering

ZusammenfassungDas Themenheft zu „Data Engineering for Data Science“ gibt uns Anlass, die Rolle dieses Themas in der akademischen Datenbanklehre im Rahmen einer kleinen Umfrage zu erfassen. In diesem Artikel geben wir die Ergebnisse gesammelt wieder. Uns haben 17 Rückmeldungen aus der GI-Fachgruppe Datenbanksysteme erreicht. Im Vergleich zu einer früheren Umfrage zur Lehre im Bereich „Cloud“, 2014 im Datenbankspektrum vorgestellt, zeichnet sich ab, dass Data-Engineering-Inhalte zunehmend auch in grundständigen Lehrveranstaltungen gelehrt werden, sowie außerhalb der Kerninformatik. Data Engineering scheint sich als ein Querschnittsthema zu etablieren, das nicht nur den Masterstudiengängen vorbehalten ist.

Download Full-text

News

Datenbank-Spektrum ◽

10.1007/s13222-021-00394-8 ◽

2021 ◽

Download Full-text

Dissertationen

Datenbank-Spektrum ◽

10.1007/s13222-021-00396-6 ◽

2021 ◽

Download Full-text

Kurz erklärt: Measuring Data Changes in Data Engineering and their Impact on Explainability and Algorithm Fairness

Datenbank-Spektrum ◽

10.1007/s13222-021-00392-w ◽

2021 ◽

Author(s):

Meike Klettke ◽

Adrian Lutsch ◽

Uta Störl

Keyword(s):

Machine Learning ◽

Data Quality ◽

Data Science ◽

Data Preprocessing ◽

Learning Processes ◽

Improve Data Quality ◽

Data Engineering ◽

Research Questions

AbstractData engineering is an integral part of any data science and ML process. It consists of several subtasks that are performed to improve data quality and to transform data into a target format suitable for analysis. The quality and correctness of the data engineering steps is therefore important to ensure the quality of the overall process.In machine learning processes requirements such as fairness and explainability are essential. The answers to these must also be provided by the data engineering subtasks. In this article, we will show how these can be achieved by logging, monitoring and controlling the data changes in order to evaluate their correctness. However, since data preprocessing algorithms are part of any machine learning pipeline, they must obviously also guarantee that they do not produce data biases.In this article we will briefly introduce three classes of methods for measuring data changes in data engineering and present which research questions still remain unanswered in this area.

Download Full-text

Editorial

Datenbank-Spektrum ◽

10.1007/s13222-021-00393-9 ◽

2021 ◽

Author(s):

Ralf Schenkel ◽

Stefanie Scherzinger ◽

Marina Tropmann-Frick ◽

Theo Härder

Download Full-text

Datenbank-Spektrum
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer-Verlag

Aggregate-based Training Phase for ML-based Cardinality Estimation

Four Generations in Data Engineering for Data Science

Data Science Meets High-Tech Manufacturing – The BTW 2021 Data Science Challenge

The Collaborative Research Center FONDA

Continuous Training and Deployment of Deep Learning Models

„Data Engineering“ in der Hochschullehre

News

Dissertationen

Kurz erklärt: Measuring Data Changes in Data Engineering and their Impact on Explainability and Algorithm Fairness

Editorial

Export Citation Format

Datenbank-SpektrumLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer-Verlag

Aggregate-based Training Phase for ML-based Cardinality Estimation

Four Generations in Data Engineering for Data Science

Data Science Meets High-Tech Manufacturing – The BTW 2021 Data Science Challenge

The Collaborative Research Center FONDA

Continuous Training and Deployment of Deep Learning Models

„Data Engineering“ in der Hochschullehre

News

Dissertationen

Kurz erklärt: Measuring Data Changes in Data Engineering and their Impact on Explainability and Algorithm Fairness

Editorial

Datenbank-Spektrum
Latest Publications