join algorithms
Recently Published Documents


TOTAL DOCUMENTS

125
(FIVE YEARS 20)

H-INDEX

16
(FIVE YEARS 1)

2021 ◽  
Vol 2021 ◽  
pp. 1-15
Author(s):  
Donghua Chen ◽  
Runtong Zhang

Join operations of data sets play a crucial role in obtaining the relations of massive data in real life. Joining two data sets with MapReduce requires a proper design of the Map and Reduce stages for different scenarios. The factors affecting MapReduce join efficiency include the density of the data sets and data transmission over clusters like Hadoop. This study aims to improve the efficiency of MapReduce join algorithms on Hadoop by leveraging Shannon entropy to measure the information changes of data sets being joined in different MapReduce stages. To reduce the uncertainty of data sets in joins through the network, a novel MapReduce join algorithm with dynamic partition strategies called dynamic partition join (DPJ) is proposed. Leveraging the changes of entropy in the partitions of data sets during the Map and Reduce stages revises the logical partitions by changing the original input of the reduce tasks in the MapReduce jobs. Experimental results indicate that the entropy-based measures can measure entropy changes of join operations. Moreover, the DPJ variant methods achieved lower entropy compared with the existing joins, thereby increasing the feasibility of MapReduce join operations for different scenarios on Hadoop.


2021 ◽  
Vol 12 (3) ◽  
Author(s):  
Leonardo Andrade Ribeiro ◽  
Felipe Ferreira Borges ◽  
Diego Oliveira

We consider the problem of efficiently answering set similarity joins on multi-attribute data. Traditional set similarity join algorithms assume string data represented by a single set and, thus, miss the opportunity to exploit predicates over multiple attributes to reduce the number of similarity computations. In this article, we present a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then instantiate this framework with a lightweight filtering technique based on a simple, yet effective data structure, for which exact and probabilistic implementations are evaluated. In this context, we devise a cost model to identify the best attribute ordering to reduce processing time. Moreover, alternative approaches are also investigated and a new algorithm combining key ideas from previous work is introduced. Finally, we present a thorough experimental evaluation, which demonstrates that our main proposal is efficient and significantly outperforms competing algorithms.


2021 ◽  
Vol 46 (3) ◽  
pp. 1-45
Author(s):  
Immanuel Trummer ◽  
Junxiong Wang ◽  
Ziyun Wei ◽  
Deepak Maram ◽  
Samuel Moseley ◽  
...  

SkinnerDB uses reinforcement learning for reliable join ordering, exploiting an adaptive processing engine with specialized join algorithms and data structures. It maintains no data statistics and uses no cost or cardinality models. Also, it uses no training workloads nor does it try to link the current query to seemingly similar queries in the past. Instead, it uses reinforcement learning to learn optimal join orders from scratch during the execution of the current query. To that purpose, it divides the execution of a query into many small time slices. Different join orders are tried in different time slices. SkinnerDB merges result tuples generated according to different join orders until a complete query result is obtained. By measuring execution progress per time slice, it identifies promising join orders as execution proceeds. Along with SkinnerDB, we introduce a new quality criterion for query execution strategies. We upper-bound expected execution cost regret, i.e., the expected amount of execution cost wasted due to sub-optimal join order choices. SkinnerDB features multiple execution strategies that are optimized for that criterion. Some of them can be executed on top of existing database systems. For maximal performance, we introduce a customized execution engine, facilitating fast join order switching via specialized multi-way join algorithms and tuple representations. We experimentally compare SkinnerDB’s performance against various baselines, including MonetDB, Postgres, and adaptive processing methods. We consider various benchmarks, including the join order benchmark, TPC-H, and JCC-H, as well as benchmark variants with user-defined functions. Overall, the overheads of reliable join ordering are negligible compared to the performance impact of the occasional, catastrophic join order choice.


2021 ◽  
Vol 2 (5) ◽  
Author(s):  
Anh-Cang Phan ◽  
Thuong-Cang Phan ◽  
Thanh-Ngoan Trieu ◽  
Thi-To-Quyen Tran

2021 ◽  
Author(s):  
Panagiotis Bouros ◽  
Nikos Mamoulis ◽  
Dimitrios Tsitsigkos ◽  
Manolis Terrovitis

AbstractThe interval join is a popular operation in temporal, spatial, and uncertain databases. The majority of interval join algorithms assume that input data reside on disk and so, their focus is to minimize the I/O accesses. Recently, an in-memory approach based on plane sweep (PS) for modern hardware was proposed which greatly outperforms previous work. However, this approach relies on a complex data structure and its parallelization has not been adequately studied. In this article, we investigate in-memory interval joins in two directions. First, we explore the applicability of a largely ignored forward scan (FS)-based plane sweep algorithm, for single-threaded join evaluation. We propose four optimizations for FS that greatly reduce its cost, making it competitive or even faster than the state-of-the-art. Second, we study in depth the parallel computation of interval joins. We design a non-partitioning-based approach that determines independent tasks of the join algorithm to run in parallel. Then, we address the drawbacks of the previously proposed hash-based partitioning and suggest a domain-based partitioning approach that does not produce duplicate results. Within our approach, we propose a novel breakdown of the partition-joins into mini-joins to be scheduled in the available CPU threads and propose an adaptive domain partitioning, aiming at load balancing. We also investigate how the partitioning phase can benefit from modern parallel hardware. Our thorough experimental analysis demonstrates the advantage of our novel partitioning-based approach for parallel computation.


2021 ◽  
pp. 1-18
Author(s):  
Salahaldeen Rababa ◽  
Amer Al-Badarneh

Large-scale datasets collected from heterogeneous sources often require a join operation to extract valuable information. MapReduce is an efficient programming model for processing large-scale data. However, it has some limitations in processing heterogeneous datasets. This is because of the large amount of redundant intermediate records that are transferred through the network. Several filtering techniques have been developed to improve the join performance, but they require multiple MapReduce jobs to process the input datasets. To address this issue, the adaptive filter-based join algorithms are presented in this paper. Specifically, three join algorithms are introduced to perform the processes of filters creation and redundant records elimination within a single MapReduce job. A cost analysis of the introduced join algorithms shows that the I/O cost is reduced compared to the state-of-the-art filter-based join algorithms. The performance of the join algorithms was evaluated in terms of the total execution time and the total amount of I/O data transferred. The experimental results show that the adaptive Bloom join, semi-adaptive intersection Bloom join, and adaptive intersection Bloom join decrease the total execution time by 30%, 25%, and 35%, respectively; and reduce the total amount of I/O data transferred by 18%, 25%, and 50%, respectively.


2021 ◽  
Author(s):  
Danila Piatov ◽  
Sven Helmer ◽  
Anton Dignös ◽  
Fabio Persia

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.


2020 ◽  
Vol 49 (3) ◽  
pp. 6-17
Author(s):  
Xiao Hu ◽  
Ke Yi

2020 ◽  
Vol 14 (4) ◽  
pp. 708-720
Author(s):  
Ran Rui ◽  
Hao Li ◽  
Yi-Cheng Tu

Relational join processing is one of the core functionalities in database management systems. It has been demonstrated that GPUs as a general-purpose parallel computing platform is very promising in processing relational joins. However, join algorithms often need to handle very large input data, which is an issue that was not sufficiently addressed in existing work. Besides, as more and more desktop and workstation platforms support multi-GPU environment, the combined computing capability of multiple GPUs can easily achieve that of a computing cluster. It is worth exploring how join processing would benefit from the adaptation of multiple GPUs. We identify the low rate and complex patterns of data transfer among the CPU and GPUs as the main challenges in designing efficient algorithms for large table joins. To overcome such challenges, we propose three distinctive designs of multi-GPU join algorithms, namely, the nested loop, global sort-merge and hybrid joins for large table joins with different join conditions. Extensive experiments running on multiple databases and two different hardware configurations demonstrate high scalability of our algorithms over data size and significant performance boost brought by the use of multiple GPUs. Furthermore, our algorithms achieve much better performance as compared to existing join algorithms, with a speedup up to 25X and 2.8X over best known code developed for multi-core CPUs and GPUs respectively.


Author(s):  
Nikolaos Tziavelis ◽  
Wolfgang Gatterbauer ◽  
Mirek Riedewald
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document