Datenbank-Spektrum
Latest Publications


TOTAL DOCUMENTS

378
(FIVE YEARS 94)

H-INDEX

12
(FIVE YEARS 2)

Published By Springer-Verlag

1610-1995, 1618-2162

Author(s):  
Lucas Woltmann ◽  
Claudio Hartmann ◽  
Dirk Habich ◽  
Wolfgang Lehner

AbstractCardinality estimation is a fundamental task in database query processing and optimization. As shown in recent papers, machine learning (ML)-based approaches may deliver more accurate cardinality estimations than traditional approaches. However, a lot of training queries have to be executed during the model training phase to learn a data-dependent ML model making it very time-consuming. Many of those training or example queries use the same base data, have the same query structure, and only differ in their selective predicates. To speed up the model training phase, our core idea is to determine a predicate-independent pre-aggregation of the base data and to execute the example queries over this pre-aggregated data. Based on this idea, we present a specific aggregate-based training phase for ML-based cardinality estimation approaches in this paper. As we are going to show with different workloads in our evaluation, we are able to achieve an average speedup of 90 with our aggregate-based training phase and thus outperform indexes.


Author(s):  
Meike Klettke ◽  
Uta Störl

AbstractData-driven methods and data science are important scientific methods in many research fields. All data science approaches require professional data engineering components. At the moment, computer science experts are needed for solving these data engineering tasks. Simultaneously, scientists from many fields (like natural sciences, medicine, environmental sciences, and engineering) want to analyse their data autonomously. The arising task for data engineering is the development of tools that can support an automated data curation and are utilisable for domain experts. In this article, we will introduce four generations of data engineering approaches classifying the data engineering technologies of the past and presence. We will show which data engineering tools are needed for the scientific landscape of the next decade.


Author(s):  
Lucas Woltmann ◽  
Peter Volk ◽  
Michael Dinzinger ◽  
Lukas Gräf ◽  
Sebastian Strasser ◽  
...  

AbstractFor its third installment, the Data Science Challenge of the 19th symposium “Database Systems for Business, Technology and Web” (BTW) of the Gesellschaft für Informatik (GI) tackled the problem of predictive energy management in large production facilities. For the first time, this year’s challenge was organized as a cooperation between Technische Universität Dresden, GlobalFoundries, and ScaDS.AI Dresden/Leipzig. The Challenge’s participants were given real-world production and energy data from the semiconductor manufacturer GlobalFoundries and had to solve the problem of predicting the energy consumption for production equipment. The usage of real-world data gave the participants a hands-on experience of challenges in Big Data integration and analysis. After a leaderboard-based preselection round, the accepted participants presented their approach to an expert jury and audience in a hybrid format. In this article, we give an overview of the main points of the Data Science Challenge, like organization and problem description. Additionally, the winning team presents its solution.


Author(s):  
Ulf Leser ◽  
Marcus Hilbrich ◽  
Claudia Draxl ◽  
Peter Eisert ◽  
Lars Grunske ◽  
...  

Author(s):  
Ioannis Prapas ◽  
Behrouz Derakhshan ◽  
Alireza Rezaei Mahdiraji ◽  
Volker Markl

AbstractDeep Learning (DL) has consistently surpassed other Machine Learning methods and achieved state-of-the-art performance in multiple cases. Several modern applications like financial and recommender systems require models that are constantly updated with fresh data. The prominent approach for keeping a DL model fresh is to trigger full retraining from scratch when enough new data are available. However, retraining large and complex DL models is time-consuming and compute-intensive. This makes full retraining costly, wasteful, and slow. In this paper, we present an approach to continuously train and deploy DL models. First, we enable continuous training through proactive training that combines samples of historical data with new streaming data. Second, we enable continuous deployment through gradient sparsification that allows us to send a small percentage of the model updates per training iteration. Our experimental results with LeNet5 on MNIST and modern DL models on CIFAR-10 show that proactive training keeps models fresh with comparable—if not superior—performance to full retraining at a fraction of the time. Combined with gradient sparsification, sparse proactive training enables very fast updates of a deployed model with arbitrarily large sparsity, reducing communication per iteration up to four orders of magnitude, with minimal—if any—losses in model quality. Sparse training, however, comes at a price; it incurs overhead on the training that depends on the size of the model and increases the training time by factors ranging from 1.25 to 3 in our experiments. Arguably, a small price to pay for successfully enabling the continuous training and deployment of large DL models.


Author(s):  
Ralf Schenkel ◽  
Stefanie Scherzinger ◽  
Marina Tropmann-Frick

ZusammenfassungDas Themenheft zu „Data Engineering for Data Science“ gibt uns Anlass, die Rolle dieses Themas in der akademischen Datenbanklehre im Rahmen einer kleinen Umfrage zu erfassen. In diesem Artikel geben wir die Ergebnisse gesammelt wieder. Uns haben 17 Rückmeldungen aus der GI-Fachgruppe Datenbanksysteme erreicht. Im Vergleich zu einer früheren Umfrage zur Lehre im Bereich „Cloud“, 2014 im Datenbankspektrum vorgestellt, zeichnet sich ab, dass Data-Engineering-Inhalte zunehmend auch in grundständigen Lehrveranstaltungen gelehrt werden, sowie außerhalb der Kerninformatik. Data Engineering scheint sich als ein Querschnittsthema zu etablieren, das nicht nur den Masterstudiengängen vorbehalten ist.


Author(s):  
Meike Klettke ◽  
Adrian Lutsch ◽  
Uta Störl

AbstractData engineering is an integral part of any data science and ML process. It consists of several subtasks that are performed to improve data quality and to transform data into a target format suitable for analysis. The quality and correctness of the data engineering steps is therefore important to ensure the quality of the overall process.In machine learning processes requirements such as fairness and explainability are essential. The answers to these must also be provided by the data engineering subtasks. In this article, we will show how these can be achieved by logging, monitoring and controlling the data changes in order to evaluate their correctness. However, since data preprocessing algorithms are part of any machine learning pipeline, they must obviously also guarantee that they do not produce data biases.In this article we will briefly introduce three classes of methods for measuring data changes in data engineering and present which research questions still remain unanswered in this area.


Author(s):  
Ralf Schenkel ◽  
Stefanie Scherzinger ◽  
Marina Tropmann-Frick ◽  
Theo Härder

Sign in / Sign up

Export Citation Format

Share Document