join algorithms Latest Research Papers

MapReduce-Based Dynamic Partition Join with Shannon Entropy for Data Skewness

Scientific Programming ◽

10.1155/2021/1602767 ◽

2021 ◽

Vol 2021 ◽

pp. 1-15

Author(s):

Donghua Chen ◽

Runtong Zhang

Keyword(s):

Shannon Entropy ◽

Real Life ◽

Massive Data ◽

Data Sets ◽

Factors Affecting ◽

Join Algorithm ◽

Proper Design ◽

Lower Entropy ◽

Join Algorithms ◽

Measure Entropy

Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. The factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. This study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.

Get full-text (via PubEx)

Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters

Journal of Information and Data Management ◽

10.5753/jidm.2021.1969 ◽

2021 ◽

Vol 12 (3) ◽

Author(s):

Leonardo Andrade Ribeiro ◽

Felipe Ferreira Borges ◽

Diego Oliveira

Keyword(s):

Data Structure ◽

Processing Time ◽

Cost Model ◽

Similarity Join ◽

Attribute Data ◽

Join Algorithms ◽

Filtering Technique ◽

Alternative Approaches ◽

Similarity Joins ◽

Single Set

We consider the problem of efficiently answering set similarity joins on multi-attribute data. Traditional set similarity join algorithms assume string data represented by a single set and, thus, miss the opportunity to exploit predicates over multiple attributes to reduce the number of similarity computations. In this article, we present a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then instantiate this framework with a lightweight filtering technique based on a simple, yet effective data structure, for which exact and probabilistic implementations are evaluated. In this context, we devise a cost model to identify the best attribute ordering to reduce processing time. Moreover, alternative approaches are also investigated and a new algorithm combining key ideas from previous work is introduced. Finally, we present a thorough experimental evaluation, which demonstrates that our main proposal is efficient and significantly outperforms competing algorithms.

Get full-text (via PubEx)

SkinnerDB: Regret-bounded Query Evaluation via Reinforcement Learning

ACM Transactions on Database Systems ◽

10.1145/3464389 ◽

2021 ◽

Vol 46 (3) ◽

pp. 1-45

Author(s):

Immanuel Trummer ◽

Junxiong Wang ◽

Ziyun Wei ◽

Deepak Maram ◽

Samuel Moseley ◽

...

Keyword(s):

Reinforcement Learning ◽

Quality Criterion ◽

Database Systems ◽

Small Time ◽

Adaptive Processing ◽

Learning To Learn ◽

Performance Impact ◽

Query Result ◽

Join Ordering ◽

Join Algorithms

SkinnerDB uses reinforcement learning for reliable join ordering, exploiting an adaptive processing engine with specialized join algorithms and data structures. It maintains no data statistics and uses no cost or cardinality models. Also, it uses no training workloads nor does it try to link the current query to seemingly similar queries in the past. Instead, it uses reinforcement learning to learn optimal join orders from scratch during the execution of the current query. To that purpose, it divides the execution of a query into many small time slices. Different join orders are tried in different time slices. SkinnerDB merges result tuples generated according to different join orders until a complete query result is obtained. By measuring execution progress per time slice, it identifies promising join orders as execution proceeds. Along with SkinnerDB, we introduce a new quality criterion for query execution strategies. We upper-bound expected execution cost regret, i.e., the expected amount of execution cost wasted due to sub-optimal join order choices. SkinnerDB features multiple execution strategies that are optimized for that criterion. Some of them can be executed on top of existing database systems. For maximal performance, we introduce a customized execution engine, facilitating fast join order switching via specialized multi-way join algorithms and tuple representations. We experimentally compare SkinnerDB’s performance against various baselines, including MonetDB, Postgres, and adaptive processing methods. We consider various benchmarks, including the join order benchmark, TPC-H, and JCC-H, as well as benchmark variants with user-defined functions. Overall, the overheads of reliable join ordering are negligible compared to the performance impact of the occasional, catastrophic join order choice.

Get full-text (via PubEx)

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

SN Computer Science ◽

10.1007/s42979-021-00738-x ◽

2021 ◽

Vol 2 (5) ◽

Author(s):

Anh-Cang Phan ◽

Thuong-Cang Phan ◽

Thanh-Ngoan Trieu ◽

Thi-To-Quyen Tran

Keyword(s):

Large Scale ◽

Experimental Comparison ◽

Join Algorithms

Get full-text (via PubEx)

In-Memory Interval Joins

The VLDB Journal ◽

10.1007/s00778-020-00639-0 ◽

2021 ◽

Author(s):

Panagiotis Bouros ◽

Nikos Mamoulis ◽

Dimitrios Tsitsigkos ◽

Manolis Terrovitis

Keyword(s):

Parallel Computation ◽

State Of The Art ◽

Complex Data ◽

Plane Sweep ◽

Join Algorithm ◽

Sweep Algorithm ◽

Join Algorithms ◽

Domain Partitioning ◽

Complex Data Structure ◽

Independent Tasks

AbstractThe interval join is a popular operation in temporal, spatial, and uncertain databases. The majority of interval join algorithms assume that input data reside on disk and so, their focus is to minimize the I/O accesses. Recently, an in-memory approach based on plane sweep (PS) for modern hardware was proposed which greatly outperforms previous work. However, this approach relies on a complex data structure and its parallelization has not been adequately studied. In this article, we investigate in-memory interval joins in two directions. First, we explore the applicability of a largely ignored forward scan (FS)-based plane sweep algorithm, for single-threaded join evaluation. We propose four optimizations for FS that greatly reduce its cost, making it competitive or even faster than the state-of-the-art. Second, we study in depth the parallel computation of interval joins. We design a non-partitioning-based approach that determines independent tasks of the join algorithm to run in parallel. Then, we address the drawbacks of the previously proposed hash-based partitioning and suggest a domain-based partitioning approach that does not produce duplicate results. Within our approach, we propose a novel breakdown of the partition-joins into mini-joins to be scheduled in the available CPU threads and propose an adaptive domain partitioning, aiming at load balancing. We also investigate how the partitioning phase can benefit from modern parallel hardware. Our thorough experimental analysis demonstrates the advantage of our novel partitioning-based approach for parallel computation.

Get full-text (via PubEx)

Optimizations for filter-based join algorithms in MapReduce

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201220 ◽

2021 ◽

pp. 1-18

Author(s):

Salahaldeen Rababa ◽

Amer Al-Badarneh

Keyword(s):

Cost Analysis ◽

Execution Time ◽

Large Scale ◽

Programming Model ◽

State Of The Art ◽

Total Execution Time ◽

Large Scale Data ◽

Heterogeneous Datasets ◽

Join Algorithms ◽

Scale Data

Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively; and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.

Get full-text (via PubEx)

Cache-efficient sweeping-based interval joins for extended Allen relation predicates

The VLDB Journal ◽

10.1007/s00778-020-00650-5 ◽

2021 ◽

Author(s):

Danila Piatov ◽

Sven Helmer ◽

Anton Dignös ◽

Fabio Persia

Keyword(s):

Data Structure ◽

Experimental Evaluation ◽

State Of The Art ◽

Temporal Databases ◽

Access Method ◽

Wide Range ◽

Interval Relation ◽

Cache Efficient ◽

Join Algorithms ◽

Better Than

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.

Get full-text (via PubEx)

Massively Parallel Join Algorithms

ACM SIGMOD Record ◽

10.1145/3444831.3444833 ◽

2020 ◽

Vol 49 (3) ◽

pp. 6-17

Author(s):

Xiao Hu ◽

Ke Yi

Keyword(s):

Massively Parallel ◽

Join Algorithms

Get full-text (via PubEx)

Efficient join algorithms for large database tables in a multi-GPU environment

Proceedings of the VLDB Endowment ◽

10.14778/3436905.3436927 ◽

2020 ◽

Vol 14 (4) ◽

pp. 708-720

Author(s):

Ran Rui ◽

Hao Li ◽

Yi-Cheng Tu

Keyword(s):

Data Transfer ◽

General Purpose ◽

Management Systems ◽

Large Database ◽

Multiple Gpus ◽

Computing Platform ◽

Significant Performance ◽

Join Algorithms ◽

High Scalability ◽

Nested Loop

Relational join processing is one of the core functionalities in database management systems. It has been demonstrated that GPUs as a general-purpose parallel computing platform is very promising in processing relational joins. However, join algorithms often need to handle very large input data, which is an issue that was not sufficiently addressed in existing work. Besides, as more and more desktop and workstation platforms support multi-GPU environment, the combined computing capability of multiple GPUs can easily achieve that of a computing cluster. It is worth exploring how join processing would benefit from the adaptation of multiple GPUs. We identify the low rate and complex patterns of data transfer among the CPU and GPUs as the main challenges in designing efficient algorithms for large table joins. To overcome such challenges, we propose three distinctive designs of multi-GPU join algorithms, namely, the nested loop, global sort-merge and hybrid joins for large table joins with different join conditions. Extensive experiments running on multiple databases and two different hardware configurations demonstrate high scalability of our algorithms over data size and significant performance boost brought by the use of multiple GPUs. Furthermore, our algorithms achieve much better performance as compared to existing join algorithms, with a speedup up to 25X and 2.8X over best known code developed for multi-core CPUs and GPUs respectively.

Get full-text (via PubEx)

Optimal Join Algorithms Meet Top-k

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data ◽

10.1145/3318464.3383132 ◽

2020 ◽

Author(s):

Nikolaos Tziavelis ◽

Wolfgang Gatterbauer ◽

Mirek Riedewald

Keyword(s):

Join Algorithms

Get full-text (via PubEx)

join algorithms
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

MapReduce-Based Dynamic Partition Join with Shannon Entropy for Data Skewness

Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters

SkinnerDB: Regret-bounded Query Evaluation via Reinforcement Learning

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

In-Memory Interval Joins

Optimizations for filter-based join algorithms in MapReduce

Cache-efficient sweeping-based interval joins for extended Allen relation predicates

Massively Parallel Join Algorithms

Efficient join algorithms for large database tables in a multi-GPU environment

Optimal Join Algorithms Meet Top-k

Export Citation Format

join algorithmsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

MapReduce-Based Dynamic Partition Join with Shannon Entropy for Data Skewness

Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters

SkinnerDB: Regret-bounded Query Evaluation via Reinforcement Learning

A Theoretical and Experimental Comparison of Large-Scale Join Algorithms in Spark

In-Memory Interval Joins

Optimizations for filter-based join algorithms in MapReduce

Cache-efficient sweeping-based interval joins for extended Allen relation predicates

Massively Parallel Join Algorithms

Efficient join algorithms for large database tables in a multi-GPU environment

Optimal Join Algorithms Meet Top-k

join algorithms
Recently Published Documents