duplicate detection Latest Research Papers

AbstractDuplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.

Download Full-text

APFA: Automated product feature alignment for duplicate detection

Expert Systems with Applications ◽

10.1016/j.eswa.2021.114759 ◽

2021 ◽

Vol 174 ◽

pp. 114759

Author(s):

Nick Valstar ◽

Flavius Frasincar ◽

Gianni Brauwers

Keyword(s):

Duplicate Detection ◽

Feature Alignment

Download Full-text

Near-Optimal and Domain-Independent Algorithms for Near-Duplicate Detection

Array ◽

10.1016/j.array.2021.100070 ◽

2021 ◽

pp. 100070

Author(s):

Aziz Fellah

Keyword(s):

Duplicate Detection ◽

Domain Independent ◽

Near Duplicate Detection

Download Full-text

Evaluation of Duplicate Detection Algorithms: From Quality Measures to Test Data Generation

2021 IEEE 37th International Conference on Data Engineering (ICDE) ◽

10.1109/icde51399.2021.00269 ◽

2021 ◽

Author(s):

Fabian Panse ◽

Felix Naumann

Keyword(s):

Test Data ◽

Quality Measures ◽

Test Data Generation ◽

Data Generation ◽

Duplicate Detection ◽

Detection Algorithms

Download Full-text

Deep String Matching For Duplicate Detection

SSRN Electronic Journal ◽

10.2139/ssrn.3847416 ◽

2021 ◽

Author(s):

Alexandre Bloch ◽

Daniel Alexandre Bloch

Keyword(s):

String Matching ◽

Duplicate Detection

Download Full-text

Identifying and Fusing Duplicate Features for Data Mining

10.5753/sbbd.2020.13631 ◽

2020 ◽

Author(s):

Hortênsia Costa Barcelos ◽

Mariana Recamonde Mendoza ◽

Viviane Pereira Moreira

Keyword(s):

Machine Learning ◽

Predictive Power ◽

Ground Truth ◽

Mortality Prediction ◽

Simple Method ◽

Training Time ◽

Duplicate Detection ◽

Original Dataset ◽

Prediction Test ◽

Small Set

This work addresses the problem of identifying and fusing duplicate features in machine learning datasets. Our goal is to evaluate the hypothesis that fusing duplicate features can improve the predictive power of the data while reducing training time. We propose a simple method for duplicate detection and fusion based on a small set of features. An evaluation comparing the duplicate detection against a manually generated ground truth obtained F1 of 0.91. Then,the effects of fusion were measured on a mortality prediction test. The results were inferior to the ones obtained with the original dataset. Thus we concluded that the investigated hypothesis does not hold.

Download Full-text

Automating Duplicate Detection for Lexical Heterogeneous Web Databases

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904170035 ◽

2020 ◽

Vol 13 ◽

Author(s):

Anil Ahlawat ◽

Kalpna Sagar

Keyword(s):

High Precision ◽

Real World ◽

Search Engines ◽

End Users ◽

Technological Advancement ◽

Web Databases ◽

Duplicate Detection ◽

Unsupervised Approach ◽

F Measure ◽

The Web

Introduction: The need for efficient search engines has been identified with the ever-increasing technological advancement and huge growing demand of data on the web. Method: Automating duplicate detection over query results in identifying the records from multiple web databases that point to the similar real-world entity and returns non-matching records to the end-users. The proposed algorithm in this paper is based upon an unsupervised approach with classifiers over heterogeneous web databases that return more accurate results with high precision, F-measure, and recall. Different assessments are also executed to analyze the efficacy of the proposed algorithm for identification of the duplicates. Result: Results show that the proposed algorithm has greater precision, F-score measure, and the same recall values as compared to standard UDD. Conclusion: This paper concludes that the proposed algorithm outperforms standard UDD. Discussion: This paper aims to introduce an algorithm that automates the process of duplicate detection for lexical heterogeneous web databases.

Download Full-text

Efficient and Effective Duplicate Detection in Hierarchical Data

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9229 ◽

2020 ◽

Vol 17 (8) ◽

pp. 3548-3552

Author(s):

M. S. Roobini ◽

P. B. S. Sumanth Kumar ◽

P. B. Raviteja Reddy ◽

Anitha Ponraj ◽

J. Aruna

Keyword(s):

High Accuracy ◽

Cutting Edge ◽

Hierarchical Data ◽

Copy Detection ◽

Bayesian Algorithm ◽

Social Data ◽

Xml Data ◽

Duplicate Detection ◽

Novel Technique

Although there is a long profession on distinguishing duplicates, just a handful in social data arrangements center around copy detection in ever more complex progressive systems, including XML data. Right now, present a novel technique for XML duplicate discovery, Renamed XMLDup. XMLDup utilizes a Bayesian algorithm defining the chance of duplicating two XML components, taking into account the data within the components, but also how data is structured. Likewise, to improve the effectiveness of Unit Review, Novel technique for pruning, equipped for noteworthy increases over the un-streamlined calculation rule, is introduced. We demonstrate through trials that our estimate is can accomplish high accuracy via trials we show our estimation is outflanking another cutting-edge duplicate discovery arrangement, both as far as proficiency and of adequacy.

Download Full-text