entity matching Latest Research Papers

Data processing speed in companies is important to speed up their analysis. Entity matching is a computational process that companies can perform in data processing. In conducting data processing, entity matching plays a role in determining two different data but referring to the same entity. Entity matching problems arise when the dataset used in the comparison is large. The deep learning concept is one of the solutions in dealing with entity matching problems. DeepMatcher is a python package based on a deep learning model architecture that can solve entity matching problems. The purpose of this study was to determine the matching between the two datasets with the application of DeepMatcher in entity matching using drug data from farmaku.com and k24klik.com. The comparison model used is the Hybrid model. Based on the test results, the Hybrid model produces accurate numbers, so that the entity matching used in this study runs well. The best accuracy value of the 10th training with an F1 value of 30.30, a precision value of 17.86, and a recall value of 100.

Download Full-text

Mixed Hierarchical Networks for Deep Entity Matching

Journal of Computer Science and Technology ◽

10.1007/s11390-021-1321-0 ◽

2021 ◽

Vol 36 (4) ◽

pp. 822-838

Author(s):

Chen-Chen Sun ◽

De-Rong Shen

Keyword(s):

Entity Matching ◽

Hierarchical Networks

Download Full-text

Deep learning for blocking in entity matching

Proceedings of the VLDB Endowment ◽

10.14778/3476249.3476294 ◽

2021 ◽

Vol 14 (11) ◽

pp. 2459-2472

Author(s):

Saravanan Thirumuruganathan ◽

Han Li ◽

Nan Tang ◽

Mourad Ouzzani ◽

Yash Govind ◽

...

Keyword(s):

Deep Learning ◽

Real World ◽

Simple Form ◽

State Of The Art ◽

Structured Data ◽

Training Data ◽

Entity Matching ◽

Large Space ◽

Sequence Modeling ◽

Textual Data

Entity matching (EM) finds data instances that refer to the same real-world entity. Most EM solutions perform blocking then matching. Many works have applied deep learning (DL) to matching, but far fewer works have applied DL to blocking. These blocking works are also limited in that they consider only a simple form of DL and some of them require labeled training data. In this paper, we develop the DeepBlocker framework that significantly advances the state of the art in applying DL to blocking for EM. We first define a large space of DL solutions for blocking, which contains solutions of varying complexity and subsumes most previous works. Next, we develop eight representative solutions in this space. These solutions do not require labeled training data and exploit recent advances in DL (e.g., sequence modeling, transformer, self supervision). We empirically determine which solutions perform best on what kind of datasets (structured, textual, or dirty). We show that the best solutions (among the above eight) outperform the best existing DL solution and the best existing non-DL solutions (including a state-of-the-art industrial non-DL solution), on dirty and textual data, and are comparable on structured data. Finally, we show that the combination of the best DL and non-DL solutions can perform even better, suggesting a new venue for research.

Download Full-text

Privacy-Preserving Federated Learning Framework with General Aggregation and Multiparty Entity Matching

Wireless Communications and Mobile Computing ◽

10.1155/2021/6692061 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Zhou Zhou ◽

Youliang Tian ◽

Changgen Peng

Keyword(s):

Data Privacy ◽

Privacy Preserving ◽

Entity Matching ◽

Learning Framework ◽

Learning Tasks ◽

Learning Scenarios ◽

Model Training ◽

Logistic Regression Algorithm ◽

Shamir Secret Sharing ◽

Training And Support

The requirement for data sharing and privacy has brought increasing attention to federated learning. However, the existing aggregation models are too specialized and deal less with users’ withdrawal issue. Moreover, protocols for multiparty entity matching are rarely covered. Thus, there is no systematic framework to perform federated learning tasks. In this paper, we systematically propose a privacy-preserving federated learning framework (PFLF) where we first construct a general secure aggregation model in federated learning scenarios by combining the Shamir secret sharing with homomorphic cryptography to ensure that the aggregated value can be decrypted correctly only when the number of participants is greater than t . Furthermore, we propose a multiparty entity matching protocol by employing secure multiparty computing to solve the entity alignment problems and a logistic regression algorithm to achieve privacy-preserving model training and support the withdrawal of users in vertical federated learning (VFL) scenarios. Finally, the security analyses prove that PFLF preserves the data privacy in the honest-but-curious model, and the experimental evaluations show PFLF attains consistent accuracy with the original model and demonstrates the practical feasibility.

Download Full-text

Dual-objective fine-tuning of BERT for entity matching

Proceedings of the VLDB Endowment ◽

10.14778/3467861.3467878 ◽

2021 ◽

Vol 14 (10) ◽

pp. 1913-1921

Author(s):

Ralph Peeters ◽

Christian Bizer

Keyword(s):

Classification Problem ◽

Training Data ◽

Fine Tuning ◽

Entity Matching ◽

Domain Specific ◽

Benchmark Datasets ◽

Word Classes ◽

Multi Class Classification ◽

Training Pair ◽

Relevant Word

An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.

Download Full-text