distributed query Latest Research Papers

The enormous quantity of data produced every day together with advances in data analytics has led to a proliferation of data management and analysis systems. Typically, these systems are built around highly specialized monolithic operators optimized for the underlying hardware. While effective in the short term, such an approach makes the operators cumbersome to port and adapt, which is increasingly required due to the speed at which algorithms and hardware evolve. To address this limitation, we present Modularis , an execution layer for data analytics based on sub-operators , i.e., composable building blocks resembling traditional database operators but at a finer granularity. To demonstrate the feasibility and advantages of our approach, we use Modularis to build a distributed query processing system supporting relational queries running on an RDMA cluster, a serverless cloud platform, and a smart storage engine. Modularis requires minimal code changes to execute queries across these three diverse hardware platforms, showing that the sub-operator approach reduces the amount and complexity of the code to maintain. In fact, changes in the platform affect only those sub-operators that depend on the underlying hardware (in our use cases, mainly the sub-operators related to network communication). We show the end-to-end performance of Modularis by comparing it with a framework for SQL processing (Presto), a commercial cluster database (SingleStore), as well as Query-as-a-Service systems (Athena, BigQuery). Modularis outperforms all these systems, proving that the design and architectural advantages of a modular design can be achieved without degrading performance. We also compare Modularis with a hand-optimized implementation of a join for RDMA clusters. We show that Modularis has the advantage of being easily extensible to a wider range of join variants and group by queries, all of which are not supported in the hand-tuned join.

Download Full-text

Database Principles and Challenges in Text Analysis

ACM SIGMOD Record ◽

10.1145/3484622.3484624 ◽

2021 ◽

Vol 50 (2) ◽

pp. 6-17

Author(s):

Johannes Doleschal ◽

Benny Kimelfeld ◽

Wim Martens

Keyword(s):

Text Analysis ◽

Relational Databases ◽

Query Evaluation ◽

Text Documents ◽

Step Process ◽

Distributed Query ◽

Recent Advances ◽

Aggregate Queries ◽

Technical Challenges ◽

Query Planning

A common conceptual view of text analysis is that of a two-step process, where we first extract relations from text documents and then apply a relational query over the result. Hence, text analysis shares technical challenges with, and can draw ideas from, relational databases. A framework that formally instantiates this connection is that of the document spanners. In this article, we review recent advances in various research efforts that adapt fundamental database concepts to text analysis through the lens of document spanners. Among others, we discuss aspects of query evaluation, aggregate queries, provenance, and distributed query planning.

Download Full-text

Compliant Geo-distributed Query Processing

Proceedings of the 2021 International Conference on Management of Data ◽

10.1145/3448016.3453687 ◽

2021 ◽

Author(s):

Kaustubh Beedkar ◽

Jorge-Arnulfo Quiané-Ruiz ◽

Volker Markl

Keyword(s):

Query Processing ◽

Distributed Query Processing ◽

Distributed Query

Download Full-text

Distributed rrays: an algebra for generic distributed query processing

Distributed and Parallel Databases ◽

10.1007/s10619-021-07325-2 ◽

2021 ◽

Author(s):

Ralf Hartmut Güting ◽

Thomas Behr ◽

Jan Kristof Nidzwetzki

Keyword(s):

Query Processing ◽

Main Memory ◽

Basic Algebra ◽

Distributed Data ◽

Data Types ◽

Distributed Query Processing ◽

Data Set ◽

Distributed Query ◽

Basic Engine ◽

Distributed Arrays

AbstractWe propose a simple model for distributed query processing based on the concept of a distributed array. Such an array has fields of some data type whose values can be stored on different machines. It offers operations to manipulate all fields in parallel within the distributed algebra. The arrays considered are one-dimensional and just serve to model a partitioned and distributed data set. Distributed arrays rest on a given set of data types and operations called the basic algebra implemented by some piece of software called the basic engine. It provides a complete environment for query processing on a single machine. We assume this environment is extensible by types and operations. Operations on distributed arrays are implemented by one basic engine called the master which controls a set of basic engines called the workers. It maps operations on distributed arrays to the respective operations on their fields executed by workers. The distributed algebra is completely generic: any type or operation added in the extensible basic engine will be immediately available for distributed query processing. To demonstrate the use of the distributed algebra as a language for distributed query processing, we describe a fairly complex algorithm for distributed density-based similarity clustering. The algorithm is a novel contribution by itself. Its complete implementation is shown in terms of the distributed algebra and the basic algebra. As a basic engine the Secondo system is used, a rich environment for extensible query processing, providing useful tools such as main memory M-trees, graphs, or a DBScan implementation.

Download Full-text

Reasoning on the Efficiency of Distributed Complex Event Processing

Fundamenta Informaticae ◽

10.3233/fi-2021-2017 ◽

2021 ◽

Vol 179 (2) ◽

pp. 113-134

Author(s):

Samira Akili ◽

Matthias Weidlich

Keyword(s):

High Frequency ◽

Complex Event Processing ◽

Data Sources ◽

Event Processing ◽

Query Evaluation ◽

Query Execution ◽

Event Data ◽

Distributed Query ◽

Geographically Distributed ◽

Network Costs

Complex event processing (CEP) evaluates queries over streams of event data to detect situations of interest. If the event data are produced by geographically distributed sources, CEP may exploit in-network processing that distributes the evaluation of a query among the nodes of a network. To this end, a query is modularized and individual query operators are assigned to nodes, especially those that act as data sources. Existing solutions for such operator placement, however, are limited in that they assume all query results to be gathered at one designated node, commonly referred to as a sink. Hence, existing techniques postulate a hierarchical structure of the network that generates and processes the event data. This largely neglects the optimisation potential that stems from truly decentralised query evaluation with potentially many sinks. To address this gap, in this paper, we propose Multi-Sink Evaluation (MuSE) graphs as a formal computational model to evaluate common CEP queries in a decentralised manner. We further prove the completeness of query evaluation under this model. Striving for distributed CEP that can scale to large volumes of high-frequency event streams, we show how to reason on the network costs induced by distributed query evaluation and prune inefficient query execution plans. As such, our work lays the foundation for distributed CEP that is both, sound and efficient.

Download Full-text