An Automatic Blocking Keys Selection For Efficient Record Linkage

One of the important processes in the data quality field is record linkage (RL). RL (also known as entity resolution) is the process of detecting duplicates that refer to the same real-world entity in one or more datasets. The most critical step during the RL process is blocking, which reduces the quadratic complexity of the process by dividing the data into a set of blocks. By that way, matching is done only between the records in the same block. However, selecting the best blocking keys to divide the data is a hard task, and in most cases, it's done by a domain expert. In this paper, a novel unsupervised approach for an automatic blocking key selection is proposed. This approach is based on the recently proposed meta-heuristic bald eagles search (bes) optimization algorithm, where the problem is treated as a feature selection case. The obtained results from experiments on real-world datasets showed the efficiency of the proposition where the BES for feature selection outperformed existed approaches in the literature and returned the best blocking keys.

Download Full-text

Dimension Reduction for Objects Composed of Vector Sets

International Journal of Applied Mathematics and Computer Science ◽

10.1515/amcs-2017-0012 ◽

2017 ◽

Vol 27 (1) ◽

pp. 169-180 ◽

Cited By ~ 1

Author(s):

Marton Szemenyei ◽

Ferenc Vajda

Keyword(s):

Machine Learning ◽

Data Mining ◽

Feature Selection ◽

Discriminant Analysis ◽

Probability Distribution ◽

Dimension Reduction ◽

Pose Estimation ◽

Real World ◽

Single Object ◽

Real World Datasets

Abstract Dimension reduction and feature selection are fundamental tools for machine learning and data mining. Most existing methods, however, assume that objects are represented by a single vectorial descriptor. In reality, some description methods assign unordered sets or graphs of vectors to a single object, where each vector is assumed to have the same number of dimensions, but is drawn from a different probability distribution. Moreover, some applications (such as pose estimation) may require the recognition of individual vectors (nodes) of an object. In such cases it is essential that the nodes within a single object remain distinguishable after dimension reduction. In this paper we propose new discriminant analysis methods that are able to satisfy two criteria at the same time: separating between classes and between the nodes of an object instance. We analyze and evaluate our methods on several different synthetic and real-world datasets.

Download Full-text

Zero-Shot Feature Selection via Transferring Supervised Knowledge

International Journal of Data Warehousing and Mining ◽

10.4018/ijdwm.2021040101 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-20

Author(s):

Zheng Wang ◽

Qiao Wang ◽

Tingzhang Zhao ◽

Chaokun Wang ◽

Xiaojun Ye

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Dimensionality Reduction ◽

Real World ◽

Rapid Growth ◽

Learning Systems ◽

Training Data ◽

Effective Technique ◽

Supervised Methods ◽

Real World Datasets

Feature selection, an effective technique for dimensionality reduction, plays an important role in many machine learning systems. Supervised knowledge can significantly improve the performance. However, faced with the rapid growth of newly emerging concepts, existing supervised methods might easily suffer from the scarcity and validity of labeled data for training. In this paper, the authors study the problem of zero-shot feature selection (i.e., building a feature selection model that generalizes well to “unseen” concepts with limited training data of “seen” concepts). Specifically, they adopt class-semantic descriptions (i.e., attributes) as supervision for feature selection, so as to utilize the supervised knowledge transferred from the seen concepts. For more reliable discriminative features, they further propose the center-characteristic loss which encourages the selected features to capture the central characteristics of seen concepts. Extensive experiments conducted on various real-world datasets demonstrate the effectiveness of the method.

Download Full-text

Multiobjective feature selection for key quality characteristic identification in production processes using a nondominated-sorting-based whale optimization algorithm

Computers & Industrial Engineering ◽

10.1016/j.cie.2020.106852 ◽

2020 ◽

Vol 149 ◽

pp. 106852

Author(s):

An-Da Li ◽

Zhen He

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Whale Optimization Algorithm ◽

Quality Characteristic ◽

Production Processes ◽

Whale Optimization ◽

Selection For ◽

Nondominated Sorting

Download Full-text

Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm

International Journal of Speech Technology ◽

10.1007/s10772-020-09776-x ◽

2020 ◽

Author(s):

M. Gomathy

Keyword(s):

Feature Selection ◽

Emotion Recognition ◽

Optimization Algorithm ◽

Speech Emotion Recognition ◽

Swarm Optimization ◽

Cat Swarm Optimization ◽

Optimal Feature Selection ◽

Selection For ◽

Optimal Feature

Download Full-text

Feature Selection for a Real-World Learning Task

Machine Learning and Data Mining in Pattern Recognition - Lecture Notes in Computer Science ◽

10.1007/3-540-44596-x_13 ◽

2001 ◽

pp. 157-172 ◽

Cited By ~ 2

Author(s):

D. Kollmar ◽

D.H. Hellmann

Keyword(s):

Feature Selection ◽

Real World ◽

Learning Task ◽

Real World Learning ◽

Selection For

Download Full-text

Privacy-preserving record linkage on large real world datasets

Journal of Biomedical Informatics ◽

10.1016/j.jbi.2013.12.003 ◽

2014 ◽

Vol 50 ◽

pp. 205-212 ◽

Cited By ~ 40

Author(s):

Sean M. Randall ◽

Anna M. Ferrante ◽

James H. Boyd ◽

Jacqueline K. Bauer ◽

James B. Semmens

Keyword(s):

Real World ◽

Record Linkage ◽

Privacy Preserving ◽

Real World Datasets

Download Full-text

Feature Selection for Predicting Heart Disease Using Black Hole Optimization Algorithm and XGBoost Classifier

2021 International Conference on Computer Communication and Informatics (ICCCI) ◽

10.1109/iccci50826.2021.9402511 ◽

2021 ◽

Author(s):

R. Rajadevi ◽

E.M.Roopa Devi ◽

R. Shanthakumari ◽

R.S. Latha ◽

N. Anitha ◽

...

Keyword(s):

Black Hole ◽

Feature Selection ◽

Heart Disease ◽

Optimization Algorithm ◽

Selection For

Download Full-text

Research on feature selection for rotating machinery based on Supervision Kernel Entropy Component Analysis with Whale Optimization Algorithm

Applied Soft Computing ◽

10.1016/j.asoc.2020.106245 ◽

2020 ◽

Vol 92 ◽

pp. 106245

Author(s):

Lili Bai ◽

Zhennan Han ◽

Jiajun Ren ◽

Xiaofeng Qin

Keyword(s):

Feature Selection ◽

Optimization Algorithm ◽

Component Analysis ◽

Rotating Machinery ◽

Whale Optimization Algorithm ◽

Whale Optimization ◽

Selection For ◽

Entropy Component

Download Full-text

Online Streaming Features Selection via Markov Blanket

Symmetry ◽

10.3390/sym14010149 ◽

2022 ◽

Vol 14 (1) ◽

pp. 149

Author(s):

Waqar Khan ◽

Lingfu Kong ◽

Brekhna Brekhna ◽

Ling Wang ◽

Huigui Yan

Keyword(s):

Feature Selection ◽

Real World ◽

Conditional Independence ◽

Prediction Accuracy ◽

Features Selection ◽

Sample Sizes ◽

Markov Blanket ◽

Independence Tests ◽

Online Feature Selection ◽

Real World Datasets

Streaming feature selection has always been an excellent method for selecting the relevant subset of features from high-dimensional data and overcoming learning complexity. However, little attention is paid to online feature selection through the Markov Blanket (MB). Several studies based on traditional MB learning presented low prediction accuracy and used fewer datasets as the number of conditional independence tests is high and consumes more time. This paper presents a novel algorithm called Online Feature Selection Via Markov Blanket (OFSVMB) based on a statistical conditional independence test offering high accuracy and less computation time. It reduces the number of conditional independence tests and incorporates the online relevance and redundant analysis to check the relevancy between the upcoming feature and target variable T, discard the redundant features from Parents-Child (PC) and Spouses (SP) online, and find PC and SP simultaneously. The performance OFSVMB is compared with traditional MB learning algorithms including IAMB, STMB, HITON-MB, BAMB, and EEMB, and Streaming feature selection algorithms including OSFS, Alpha-investing, and SAOLA on 9 benchmark Bayesian Network (BN) datasets and 14 real-world datasets. For the performance evaluation, F1, precision, and recall measures are used with a significant level of 0.01 and 0.05 on benchmark BN and real-world datasets, including 12 classifiers keeping a significant level of 0.01. On benchmark BN datasets with 500 and 5000 sample sizes, OFSVMB achieved significant accuracy than IAMB, STMB, HITON-MB, BAMB, and EEMB in terms of F1, precision, recall, and running faster. It finds more accurate MB regardless of the size of the features set. In contrast, OFSVMB offers substantial improvements based on mean prediction accuracy regarding 12 classifiers with small and large sample sizes on real-world datasets than OSFS, Alpha-investing, and SAOLA but slower than OSFS, Alpha-investing, and SAOLA because these algorithms only find the PC set but not SP. Furthermore, the sensitivity analysis shows that OFSVMB is more accurate in selecting the optimal features.

Download Full-text

Comparing Methods for Record Linkage for Public Health Action: Matching Algorithm Validation Study (Preprint)

10.2196/preprints.15917 ◽

2019 ◽

Author(s):

Tigran Avoundjian ◽

Julia C Dombrowski ◽

Matthew R Golden ◽

James P Hughes ◽

Brandon L Guthrie ◽

...

Keyword(s):

Public Health ◽

Data Quality ◽

Real World ◽

Record Linkage ◽

Computation Time ◽

Health Departments ◽

Public Health Action ◽

Deterministic Algorithms ◽

Health Action ◽

Average Computation Time

BACKGROUND Many public health departments use record linkage between surveillance data and external data sources to inform public health interventions. However, little guidance is available to inform these activities, and many health departments rely on deterministic algorithms that may miss many true matches. In the context of public health action, these missed matches lead to missed opportunities to deliver interventions and may exacerbate existing health inequities. OBJECTIVE This study aimed to compare the performance of record linkage algorithms commonly used in public health practice. METHODS We compared five deterministic (exact, Stenger, Ocampo 1, Ocampo 2, and Bosh) and two probabilistic record linkage algorithms (fastLink and beta record linkage [BRL]) using simulations and a real-world scenario. We simulated pairs of datasets with varying numbers of errors per record and the number of matching records between the two datasets (ie, overlap). We matched the datasets using each algorithm and calculated their recall (ie, sensitivity, the proportion of true matches identified by the algorithm) and precision (ie, positive predictive value, the proportion of matches identified by the algorithm that were true matches). We estimated the average computation time by performing a match with each algorithm 20 times while varying the size of the datasets being matched. In a real-world scenario, HIV and sexually transmitted disease surveillance data from King County, Washington, were matched to identify people living with HIV who had a syphilis diagnosis in 2017. We calculated the recall and precision of each algorithm compared with a composite standard based on the agreement in matching decisions across all the algorithms and manual review. RESULTS In simulations, BRL and fastLink maintained a high recall at nearly all data quality levels, while being comparable with deterministic algorithms in terms of precision. Deterministic algorithms typically failed to identify matches in scenarios with low data quality. All the deterministic algorithms had a shorter average computation time than the probabilistic algorithms. BRL had the slowest overall computation time (14 min when both datasets contained 2000 records). In the real-world scenario, BRL had the lowest trade-off between recall (309/309, 100.0%) and precision (309/312, 99.0%). CONCLUSIONS Probabilistic record linkage algorithms maximize the number of true matches identified, reducing gaps in the coverage of interventions and maximizing the reach of public health action.

Download Full-text