Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters

We consider the problem of efficiently answering set similarity joins on multi-attribute data. Traditional set similarity join algorithms assume string data represented by a single set and, thus, miss the opportunity to exploit predicates over multiple attributes to reduce the number of similarity computations. In this article, we present a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then instantiate this framework with a lightweight filtering technique based on a simple, yet effective data structure, for which exact and probabilistic implementations are evaluated. In this context, we devise a cost model to identify the best attribute ordering to reduce processing time. Moreover, alternative approaches are also investigated and a new algorithm combining key ideas from previous work is introduced. Finally, we present a thorough experimental evaluation, which demonstrates that our main proposal is efficient and significantly outperforms competing algorithms.

Download Full-text

A Framework for Set Similarity Join on Multi-Attribute Data

10.5753/sbbd.2020.13625 ◽

2020 ◽

Author(s):

Leonardo Andrade Ribeiro ◽

Felipe Ferreira Borges ◽

Diego Junior do Carmo Oliveira

Keyword(s):

Real World ◽

Processing Time ◽

Data Cleaning ◽

Cost Model ◽

Experimental Results ◽

Real World Data ◽

Similarity Join ◽

World Data ◽

Attribute Data ◽

Single Attribute

Set similarity join, which finds all pairs of similar sets in a collection, plays an important role in data cleaning and integration. Many algorithms have been proposed to efficiently answer set similarity join on single-attribute data. However, real-world data often contain multiple attributes. In this paper, we propose a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then present a simple, yet effective filter based on lightweight indexes, for which exact and probabilistic implementation alternatives are evaluated. Finally, we devise a cost model to identify the best attribute ordering to reduce processing time. Our experimental results show that our approach is effective and significantly outperforms previous work.

Download Full-text

Cache-efficient sweeping-based interval joins for extended Allen relation predicates

The VLDB Journal ◽

10.1007/s00778-020-00650-5 ◽

2021 ◽

Author(s):

Danila Piatov ◽

Sven Helmer ◽

Anton Dignös ◽

Fabio Persia

Keyword(s):

Data Structure ◽

Experimental Evaluation ◽

State Of The Art ◽

Temporal Databases ◽

Access Method ◽

Wide Range ◽

Interval Relation ◽

Cache Efficient ◽

Join Algorithms ◽

Better Than

AbstractWe develop a family of efficient plane-sweeping interval join algorithms for evaluating a wide range of interval predicates such as Allen’s relationships and parameterized relationships. Our technique is based on a framework, components of which can be flexibly combined in different manners to support the required interval relation. In temporal databases, our algorithms can exploit a well-known and flexible access method, the Timeline Index, thus expanding the set of operations it supports even further. Additionally, employing a compact data structure, the gapless hash map, we utilize the CPU cache efficiently. In an experimental evaluation, we show that our approach is several times faster and scales better than state-of-the-art techniques, while being much better suited for real-time event processing.

Download Full-text

Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework

Lecture Notes in Computer Science - Database and Expert Systems Applications ◽

10.1007/978-3-319-44403-1_12 ◽

2016 ◽

pp. 185-204 ◽

Cited By ~ 3

Author(s):

Leonardo Andrade Ribeiro ◽

Alfredo Cuzzocrea ◽

Karen Aline Alves Bezerra ◽

Ben Hur Bahia do Nascimento

Keyword(s):

Similarity Join ◽

Join Algorithms

Download Full-text

A Multi-attribute Data Structure with Parallel Bloom Filters for Network Services

High Performance Computing - HiPC 2006 - Lecture Notes in Computer Science ◽

10.1007/11945918_30 ◽

2006 ◽

pp. 277-288 ◽

Cited By ~ 9

Author(s):

Yu Hua ◽

Bin Xiao

Keyword(s):

Data Structure ◽

Bloom Filters ◽

Network Services ◽

Attribute Data

Download Full-text

List of Twin Clusters: A Data Structure for Similarity Joins in Metric Spaces

First International Workshop on Similarity Search and Applications (sisap 2008) ◽

10.1109/sisap.2008.21 ◽

2008 ◽

Cited By ~ 1

Author(s):

Rodrigo Paredes ◽

Nora Reyes

Keyword(s):

Data Structure ◽

Metric Spaces ◽

Similarity Joins

Download Full-text

Study on cadastral basic attribute data structure based on man-land relationship

10.1117/12.838684 ◽

2009 ◽

Author(s):

Changgen Zhan ◽

Yaolin Liu

Keyword(s):

Data Structure ◽

Attribute Data

Download Full-text

List of twin clusters: a data structure for similarity joins in metric spaces

2008 IEEE 24th International Conference on Data Engineering Workshop ◽

10.1109/icdew.2008.4498353 ◽

2008 ◽

Author(s):

Rodrigo Paredes ◽

Nora Reyes

Keyword(s):

Data Structure ◽

Metric Spaces ◽

Similarity Joins

Download Full-text

Optimization and Extension of Stream-Relation Joins

International Journal of Information Technology & Decision Making ◽

10.1142/s0219622019500214 ◽

2019 ◽

Vol 18 (04) ◽

pp. 1289-1315

Author(s):

M. Asif Naeem

Keyword(s):

Cost Model ◽

Stream Processing ◽

Research Area ◽

Skewed Distribution ◽

Selection Strategy ◽

Service Rate ◽

Stream Data ◽

Multi Stage ◽

Join Algorithms ◽

Rate Evaluation

Online stream processing is an emerging research area in the field of computer science. Semi-stream processing is a particular type of stream processing where a stream of data is processed with a disk-based relation. A semi-stream join operator is required to implement this operation. Many semi-stream joins use a queue of stream tuples to amortize access cost for the disk-based relation, and use an index to allow directed access to the relation, avoiding the loading of unnecessary partition of [Formula: see text]. In such a situation, the question arises which [Formula: see text] partitions should be accessed, as any stream tuple from the queue could serve as a lookup element for accessing the relation index. Existing algorithms use simple safe and correct strategies, but are not optimal in the sense that they maximize the join service rate. This paper makes two contributions: first contribution is in terms of optimization in which we analyze strategies for selecting an appropriate lookup element, particularly for skewed stream data. We show that a good selection strategy can improve service rate of the existing join algorithms significantly. Second contribution is in terms of extension in which we develop multi-stage join for semi-stream join algorithms. Multi-stage join is important when stream data needs to be joined with two or more tables in the relation e.g., stream of sales data needs information to be added from product and customer tables in the relation. To the best of our knowledge, none of the existing algorithms implement this feature. For the service rate evaluation we use two well-performed existing algorithms CACHEJOIN and HYBRIDJOIN. We evaluate the service rate using real, TPC-H, and synthetic datasets with a known skewed distribution. We also present the cost model for our multi-stage join.

Download Full-text