Incorporating Clustering into Set Similarity Join Algorithms: The SjClust Framework

We consider the problem of efficiently answering set similarity joins on multi-attribute data. Traditional set similarity join algorithms assume string data represented by a single set and, thus, miss the opportunity to exploit predicates over multiple attributes to reduce the number of similarity computations. In this article, we present a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then instantiate this framework with a lightweight filtering technique based on a simple, yet effective data structure, for which exact and probabilistic implementations are evaluated. In this context, we devise a cost model to identify the best attribute ordering to reduce processing time. Moreover, alternative approaches are also investigated and a new algorithm combining key ideas from previous work is introduced. Finally, we present a thorough experimental evaluation, which demonstrates that our main proposal is efficient and significantly outperforms competing algorithms.

Download Full-text

SjClust: A Framework for Incorporating Clustering into Set Similarity Join Algorithms

Lecture Notes in Computer Science - Transactions on Large-Scale Data- and Knowledge-Centered Systems XXXVIII ◽

10.1007/978-3-662-58384-5_4 ◽

2018 ◽

pp. 89-118

Author(s):

Leonardo Andrade Ribeiro ◽

Alfredo Cuzzocrea ◽

Karen Aline Alves Bezerra ◽

Ben Hur Bahia do Nascimento

Keyword(s):

Similarity Join ◽

Join Algorithms

Download Full-text

Improving similarity join algorithms using vertical clustering techniques

2009 Second International Conference on the Applications of Digital Information and Web Technologies ◽

10.1109/icadiwt.2009.5273906 ◽

2009 ◽

Cited By ~ 1

Author(s):

Lisa Tan ◽

Farshad Fotouhi ◽

William Grosky

Keyword(s):

Similarity Join ◽

Clustering Techniques ◽

Join Algorithms

Download Full-text

Parallel Top-K Similarity Join Algorithms Using MapReduce

2012 IEEE 28th International Conference on Data Engineering ◽

10.1109/icde.2012.87 ◽

2012 ◽

Cited By ~ 42

Author(s):

Younghoon Kim ◽

Kyuseok Shim

Keyword(s):

Similarity Join ◽

Join Algorithms

Download Full-text

Parallelizing String Similarity Join Algorithms

Lecture Notes in Computer Science - Databases Theory and Applications ◽

10.1007/978-3-319-92013-9_27 ◽

2018 ◽

pp. 322-327

Author(s):

Ling-Chih Yao ◽

Lipyeow Lim

Keyword(s):

Similarity Join ◽

String Similarity ◽

Join Algorithms

Download Full-text

SJClust: Towards a Framework for Integrating Similarity Join Algorithms and Clustering

Proceedings of the 18th International Conference on Enterprise Information Systems ◽

10.5220/0005868700750080 ◽

2016 ◽

Cited By ~ 3

Author(s):

Leonardo Andrade Ribeiro ◽

Alfredo Cuzzocrea ◽

Karen Aline Alves Bezerra ◽

Ben Hur Bahia do Nascimento

Keyword(s):

Similarity Join ◽

Join Algorithms

Download Full-text

SURVEY OF SIMILARITY JOIN ALGORITHMS BASED ON MAPREDUCE

MATTER International Journal of Science and Technology ◽

10.20319/mijst.2016.s21.214234 ◽

2017 ◽

Vol 2 (1) ◽

pp. 214-234

Author(s):

Amer Al-Badarneh ◽

◽

Amnah Al-Abdi ◽

Sana’a Al-Shboul ◽

Hassan Najadat ◽

...

Keyword(s):

Similarity Join ◽

Join Algorithms

Download Full-text

Improving Similarity Join Algorithms Using Fuzzy Clustering Technique

2009 IEEE International Conference on Data Mining Workshops ◽

10.1109/icdmw.2009.50 ◽

2009 ◽

Author(s):

Lisa Tan ◽

Farshad Fotouhi ◽

William Grosky ◽

Horia F. Pop ◽

Noureddine Mouaddib

Keyword(s):

Fuzzy Clustering ◽

Similarity Join ◽

Clustering Technique ◽

Join Algorithms

Download Full-text

In-Memory Interval Joins

The VLDB Journal ◽

10.1007/s00778-020-00639-0 ◽

2021 ◽

Author(s):

Panagiotis Bouros ◽

Nikos Mamoulis ◽

Dimitrios Tsitsigkos ◽

Manolis Terrovitis

Keyword(s):

Parallel Computation ◽

State Of The Art ◽

Complex Data ◽

Plane Sweep ◽

Join Algorithm ◽

Sweep Algorithm ◽

Join Algorithms ◽

Domain Partitioning ◽

Complex Data Structure ◽

Independent Tasks

AbstractThe interval join is a popular operation in temporal, spatial, and uncertain databases. The majority of interval join algorithms assume that input data reside on disk and so, their focus is to minimize the I/O accesses. Recently, an in-memory approach based on plane sweep (PS) for modern hardware was proposed which greatly outperforms previous work. However, this approach relies on a complex data structure and its parallelization has not been adequately studied. In this article, we investigate in-memory interval joins in two directions. First, we explore the applicability of a largely ignored forward scan (FS)-based plane sweep algorithm, for single-threaded join evaluation. We propose four optimizations for FS that greatly reduce its cost, making it competitive or even faster than the state-of-the-art. Second, we study in depth the parallel computation of interval joins. We design a non-partitioning-based approach that determines independent tasks of the join algorithm to run in parallel. Then, we address the drawbacks of the previously proposed hash-based partitioning and suggest a domain-based partitioning approach that does not produce duplicate results. Within our approach, we propose a novel breakdown of the partition-joins into mini-joins to be scheduled in the available CPU threads and propose an adaptive domain partitioning, aiming at load balancing. We also investigate how the partitioning phase can benefit from modern parallel hardware. Our thorough experimental analysis demonstrates the advantage of our novel partitioning-based approach for parallel computation.

Download Full-text

PPIS-JOIN: A Novel Privacy-Preserving Image Similarity Join Method

Neural Processing Letters ◽

10.1007/s11063-021-10537-3 ◽

2021 ◽

Author(s):

Chengyuan Zhang ◽

Fangxin Xie ◽

Hao Yu ◽

Jianfeng Zhang ◽

Lei Zhu ◽

...

Keyword(s):

Privacy Preserving ◽

Image Similarity ◽

Similarity Join

Download Full-text