similarity joins Latest Research Papers

Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters

Journal of Information and Data Management ◽

10.5753/jidm.2021.1969 ◽

2021 ◽

Vol 12 (3) ◽

Author(s):

Leonardo Andrade Ribeiro ◽

Felipe Ferreira Borges ◽

Diego Oliveira

Keyword(s):

Data Structure ◽

Processing Time ◽

Cost Model ◽

Similarity Join ◽

Attribute Data ◽

Join Algorithms ◽

Filtering Technique ◽

Alternative Approaches ◽

Similarity Joins ◽

Single Set

We consider the problem of efficiently answering set similarity joins on multi-attribute data. Traditional set similarity join algorithms assume string data represented by a single set and, thus, miss the opportunity to exploit predicates over multiple attributes to reduce the number of similarity computations. In this article, we present a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then instantiate this framework with a lightweight filtering technique based on a simple, yet effective data structure, for which exact and probabilistic implementations are evaluated. In this context, we devise a cost model to identify the best attribute ordering to reduce processing time. Moreover, alternative approaches are also investigated and a new algorithm combining key ideas from previous work is introduced. Finally, we present a thorough experimental evaluation, which demonstrates that our main proposal is efficient and significantly outperforms competing algorithms.

Parallelizing filter-and-verification based exact set similarity joins on multicores

Information Systems ◽

10.1016/j.is.2021.101912 ◽

2021 ◽

pp. 101912

Author(s):

Fabian Fier ◽

Johann-Christoph Freytag

Keyword(s):

Similarity Joins

Scalable signal reconstruction for a broad range of applications

Communications of the ACM ◽

10.1145/3441689 ◽

2021 ◽

Vol 64 (2) ◽

pp. 106-115

Author(s):

Abolfazl Asudeh ◽

Jees Augustine ◽

Saravanan Thirumuruganathan ◽

Azade Nazi ◽

Nan Zhang ◽

...

Keyword(s):

Traffic Engineering ◽

Optimization Problem ◽

Signal Reconstruction ◽

Linear Equations ◽

Synthetic Data ◽

Scalable Algorithm ◽

Underdetermined System ◽

Medical Image Reconstruction ◽

The Common ◽

Similarity Joins

Signal reconstruction problem (SRP) is an important optimization problem where the objective is to identify a solution to an underdetermined system of linear equations that is closest to a given prior. It has a substantial number of applications in diverse areas, such as network traffic engineering, medical image reconstruction, acoustics, astronomy, and many more. Unfortunately, most of the common approaches for solving SRP do not scale to large problem sizes. We propose a novel and scalable algorithm for solving this critical problem. Specifically, we make four major contributions. First, we propose a dual formulation of the problem and develop the DIRECT algorithm that is significantly more efficient than the state of the art. Second, we show how adapting database techniques developed for scalable similarity joins provides a substantial speedup over DIRECT. Third, we describe several practical techniques that allow our algorithm to scale---on a single machine---to settings that are orders of magnitude larger than previously studied. Finally, we use the database techniques of materialization and reuse to extend our result to dynamic settings where the input to the SRP changes. Extensive experiments on real-world and synthetic data confirm the efficiency, effectiveness, and scalability of our proposal.

Streaming Set Similarity Joins

Enterprise Information Systems - Lecture Notes in Business Information Processing ◽

10.1007/978-3-030-75418-1_2 ◽

2021 ◽

pp. 24-42

Author(s):

Lucas Pacífico ◽

Leonardo Andrade Ribeiro

Keyword(s):

Similarity Joins

Handling data-skewness in character based string similarity join using Hadoop

Applied Computing and Informatics ◽

10.1016/j.aci.2018.11.001 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Cited By ~ 1

Author(s):

Kanak Meena ◽

Devendra K. Tayal ◽

Oscar Castillo ◽

Amita Jain

Keyword(s):

Scientific Data ◽

Distribution Law ◽

Similarity Join ◽

String Similarity ◽

Zipf Distribution ◽

Imbalance Problem ◽

Data Skewness ◽

Pair Generation ◽

Set Up ◽

Similarity Joins

The scalability of similarity joins is threatened by the unexpected data characteristic of data skewness. This is a pervasive problem in scientific data. Due to skewness, the uneven distribution of attributes occurs, and it can cause a severe load imbalance problem. When database join operations are applied to these datasets, skewness occurs exponentially. All the algorithms developed to date for the implementation of database joins are highly skew sensitive. This paper presents a new approach for handling data-skewness in a character- based string similarity join using the MapReduce framework. In the literature, no such work exists to handle data skewness in character-based string similarity join, although work for set based string similarity joins exists. Proposed work has been divided into three stages, and every stage is further divided into mapper and reducer phases, which are dedicated to a specific task. The first stage is dedicated to finding the length of strings from a dataset. For valid candidate pair generation, MR-Pass Join framework has been suggested in the second stage. MRFA concepts are incorporated for string similarity join, which is named as “MRFA-SSJ” (MapReduce Frequency Adaptive – String Similarity Join) in the third stage which is further divided into four MapReduce phases. Hence, MRFA-SSJ has been proposed to handle skewness in the string similarity join. The experiments have been implemented on three different datasets namely: DBLP, Query log and a real dataset of IP addresses & Cookies by deploying Hadoop framework. The proposed algorithm has been compared with three known algorithms and it has been noticed that all these algorithms fail when data is highly skewed, whereas our proposed method handles highly skewed data without any problem. A set-up of the 15-node cluster has been used in this experiment, and we are following the Zipf distribution law for the analysis of skewness factor. Also, a comparison among existing and proposed techniques has been shown. Existing techniques survived till Zipf factor 0.5 whereas the proposed algorithm survives up to Zipf factor 1. Hence the proposed algorithm is skew insensitive and ensures scalability with a reasonable query processing time for string similarity database join. It also ensures the even distribution of attributes.

Top-k String Similarity Joins

32nd International Conference on Scientific and Statistical Database Management ◽

10.1145/3400903.3400922 ◽

2020 ◽

Author(s):

Shuyao Qi ◽

Panagiotis Bouros ◽

Nikos Mamoulis

Keyword(s):

String Similarity ◽

Similarity Joins

An efficient algorithm for approximated self-similarity joins in metric spaces

Information Systems ◽

10.1016/j.is.2020.101510 ◽

2020 ◽

Vol 91 ◽

pp. 101510 ◽

Cited By ~ 1

Author(s):

Sebastián Ferrada ◽

Benjamin Bustos ◽

Nora Reyes

Keyword(s):

Efficient Algorithm ◽

Metric Spaces ◽

Self Similarity ◽

Similarity Joins

An Industrial Dynamic Skyline Based Similarity Joins For Multidimensional Big Data Applications

IEEE Transactions on Industrial Informatics ◽

10.1109/tii.2019.2933534 ◽

2020 ◽

Vol 16 (4) ◽

pp. 2520-2532 ◽

Cited By ~ 6

Author(s):

Bo Yin ◽

Xuetao Wei ◽

Jin Wang ◽

Naixue Xiong ◽

Ke Gu

Keyword(s):

Big Data ◽

Big Data Applications ◽

Similarity Joins

Adaptive Top-k Overlap Set Similarity Joins

2020 IEEE 36th International Conference on Data Engineering (ICDE) ◽

10.1109/icde48307.2020.00098 ◽

2020 ◽

Author(s):

Zhong Yang ◽

Bolong Zheng ◽

Guohui Li ◽

Xi Zhao ◽

Xiaofang Zhou ◽

...

Keyword(s):

Similarity Joins

Bitmap filter: Speeding up exact set similarity joins with bitwise operations

Information Systems ◽

10.1016/j.is.2019.101449 ◽

2020 ◽

Vol 88 ◽

pp. 101449

Author(s):

Edans F.O. Sandes ◽

George L.M. Teodoro ◽

Alba C.M.A. Melo

Keyword(s):

Similarity Joins ◽

Bitwise Operations

similarity joins
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters

Parallelizing filter-and-verification based exact set similarity joins on multicores

Scalable signal reconstruction for a broad range of applications

Streaming Set Similarity Joins

Handling data-skewness in character based string similarity join using Hadoop

Top-k String Similarity Joins

An efficient algorithm for approximated self-similarity joins in metric spaces

An Industrial Dynamic Skyline Based Similarity Joins For Multidimensional Big Data Applications

Adaptive Top-k Overlap Set Similarity Joins

Bitmap filter: Speeding up exact set similarity joins with bitwise operations

Export Citation Format

similarity joinsRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Efficient Set Similarity Join on Multi-Attribute Data Using Lightweight Filters

Parallelizing filter-and-verification based exact set similarity joins on multicores

Scalable signal reconstruction for a broad range of applications

Streaming Set Similarity Joins

Handling data-skewness in character based string similarity join using Hadoop

Top-k String Similarity Joins

An efficient algorithm for approximated self-similarity joins in metric spaces

An Industrial Dynamic Skyline Based Similarity Joins For Multidimensional Big Data Applications

Adaptive Top-k Overlap Set Similarity Joins

Bitmap filter: Speeding up exact set similarity joins with bitwise operations

similarity joins
Recently Published Documents