duplicate detection
Recently Published Documents


TOTAL DOCUMENTS

260
(FIVE YEARS 29)

H-INDEX

20
(FIVE YEARS 2)

2021 ◽  
Vol 8 (1) ◽  
Author(s):  
Abdulrazzak Ali ◽  
Nurul A. Emran ◽  
Siti A. Asmai

AbstractDuplicate record is a common problem within data sets especially in huge volume databases. The accuracy of duplicate detection determines the efficiency of duplicate removal process. However, duplicate detection has become more challenging due to the presence of missing values within the records where during the clustering and matching process, missing values can cause records deemed similar to be inserted into the wrong group, hence, leading to undetected duplicates. In this paper, duplicate detection improvement was proposed despite the presence of missing values within a data set through Duplicate Detection within the Incomplete Data set (DDID) method. The missing values were hypothetically added to the key attributes of three data sets under study, using an arbitrary pattern to simulate both complete and incomplete data sets. The results were analyzed, then, the performance of duplicate detection was evaluated by using the Hot Deck method to compensate for the missing values in the key attributes. It was hypothesized that by using Hot Deck, duplicate detection performance would be improved. Furthermore, the DDID performance was compared to an early duplicate detection method namely DuDe, in terms of its accuracy and speed. The findings yielded that even though the data sets were incomplete, DDID was able to offer a better accuracy and faster duplicate detection as compared to DuDe. The results of this study offer insights into constraints of duplicate detection within incomplete data sets.


2021 ◽  
Vol 174 ◽  
pp. 114759
Author(s):  
Nick Valstar ◽  
Flavius Frasincar ◽  
Gianni Brauwers

2021 ◽  
Author(s):  
Alexandre Bloch ◽  
Daniel Alexandre Bloch

2020 ◽  
Author(s):  
Hortênsia Costa Barcelos ◽  
Mariana Recamonde Mendoza ◽  
Viviane Pereira Moreira

This work addresses the problem of identifying and fusing duplicate features in machine learning datasets. Our goal is to evaluate the hypothesis that fusing duplicate features can improve the predictive power of the data while reducing training time. We propose a simple method for duplicate detection and fusion based on a small set of features. An evaluation comparing the duplicate detection against a manually generated ground truth obtained F1 of 0.91. Then,the effects of fusion were measured on a mortality prediction test. The results were inferior to the ones obtained with the original dataset. Thus we concluded that the investigated hypothesis does not hold.


Author(s):  
Anil Ahlawat ◽  
Kalpna Sagar

Introduction: The need for efficient search engines has been identified with the ever-increasing technological advancement and huge growing demand of data on the web. Method: Automating duplicate detection over query results in identifying the records from multiple web databases that point to the similar real-world entity and returns non-matching records to the end-users. The proposed algorithm in this paper is based upon an unsupervised approach with classifiers over heterogeneous web databases that return more accurate results with high precision, F-measure, and recall. Different assessments are also executed to analyze the efficacy of the proposed algorithm for identification of the duplicates. Result: Results show that the proposed algorithm has greater precision, F-score measure, and the same recall values as compared to standard UDD. Conclusion: This paper concludes that the proposed algorithm outperforms standard UDD. Discussion: This paper aims to introduce an algorithm that automates the process of duplicate detection for lexical heterogeneous web databases.


2020 ◽  
Vol 17 (8) ◽  
pp. 3548-3552
Author(s):  
M. S. Roobini ◽  
P. B. S. Sumanth Kumar ◽  
P. B. Raviteja Reddy ◽  
Anitha Ponraj ◽  
J. Aruna

Although there is a long profession on distinguishing duplicates, just a handful in social data arrangements center around copy detection in ever more complex progressive systems, including XML data. Right now, present a novel technique for XML duplicate discovery, Renamed XMLDup. XMLDup utilizes a Bayesian algorithm defining the chance of duplicating two XML components, taking into account the data within the components, but also how data is structured. Likewise, to improve the effectiveness of Unit Review, Novel technique for pruning, equipped for noteworthy increases over the un-streamlined calculation rule, is introduced. We demonstrate through trials that our estimate is can accomplish high accuracy via trials we show our estimation is outflanking another cutting-edge duplicate discovery arrangement, both as far as proficiency and of adequacy.


2020 ◽  
Vol 12 (3) ◽  
pp. 1-24
Author(s):  
Ioannis Koumarelas ◽  
Lan Jiang ◽  
Felix Naumann

Sign in / Sign up

Export Citation Format

Share Document