scholarly journals Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters

2021 ◽  
Vol 12 (3) ◽  
Author(s):  
Leonardo Andrade Ribeiro ◽  
Felipe Ferreira Borges ◽  
Diego Oliveira

We consider the problem of efficiently answering set similarity joins on multi-attribute data. Traditional set similarity join algorithms assume string data represented by a single set and, thus, miss the opportunity to exploit predicates over multiple attributes to reduce the number of similarity computations. In this article, we present a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then instantiate this framework with a lightweight filtering technique based on a simple, yet effective data structure, for which exact and probabilistic implementations are evaluated. In this context, we devise a cost model to identify the best attribute ordering to reduce processing time. Moreover, alternative approaches are also investigated and a new algorithm combining key ideas from previous work is introduced. Finally, we present a thorough experimental evaluation, which demonstrates that our main proposal is efficient and significantly outperforms competing algorithms.

2020 ◽  
Author(s):  
Leonardo Andrade Ribeiro ◽  
Felipe Ferreira Borges ◽  
Diego Junior do Carmo Oliveira

Set similarity join, which finds all pairs of similar sets in a collection, plays an important role in data cleaning and integration. Many algorithms have been proposed to efficiently answer set similarity join on single-attribute data. However, real-world data often contain multiple attributes. In this paper, we propose a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then present a simple, yet effective filter based on lightweight indexes, for which exact and probabilistic implementation alternatives are evaluated. Finally, we devise a cost model to identify the best attribute ordering to reduce processing time. Our experimental results show that our approach is effective and significantly outperforms previous work.


2021 ◽  
Author(s):  
Danila Piatov ◽  
Sven Helmer ◽  
Anton Dignös ◽  
Fabio Persia

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.


Author(s):  
Leonardo Andrade Ribeiro ◽  
Alfredo Cuzzocrea ◽  
Karen Aline Alves Bezerra ◽  
Ben Hur Bahia do Nascimento

2019 ◽  
Vol 18 (04) ◽  
pp. 1289-1315
Author(s):  
M. Asif Naeem

Online stream processing is an emerging research area in the field of computer science. Semi-stream processing is a particular type of stream processing where a stream of data is processed with a disk-based relation. A semi-stream join operator is required to implement this operation. Many semi-stream joins use a queue of stream tuples to amortize access cost for the disk-based relation, and use an index to allow directed access to the relation, avoiding the loading of unnecessary partition of [Formula: see text]. In such a situation, the question arises which [Formula: see text] partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the relation index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. This paper makes two contributions: first contribution is in terms of optimization in which we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve service rate of the existing join algorithms significantly. Second contribution is in terms of extension in which we develop multi-stage join for semi-stream join algorithms. Multi-stage join is important when stream data needs to be joined with two or more tables in the relation e.g., stream of sales data needs information to be added from product and customer tables in the relation. To the best of our knowledge, none of the existing algorithms implement this feature. For the service rate evaluation we use two well-performed existing algorithms CACHEJOIN and HYBRIDJOIN. We evaluate the service rate using real, TPC-H, and synthetic datasets with a known skewed distribution. We also present the cost model for our multi-stage join.


Author(s):  
Leonardo Andrade Ribeiro ◽  
Alfredo Cuzzocrea ◽  
Karen Aline Alves Bezerra ◽  
Ben Hur Bahia do Nascimento

Sign in / Sign up

Export Citation Format

Share Document