Accurate privacy-preserving record linkage for databases with missing values

IntroductionApplications in domains ranging from healthcare to national security increasingly require records about individuals in sensitive databases to be linked in privacy-preserving ways. Missing values make the linkage process challenging because they can affect the encoding of attribute values. No study has systematically investigated how missing values affect the outcomes of different encoding techniques used in privacy-preserving linkage applications. Objectives and ApproachBinary encodings, such as Bloom filters, are popular for linking sensitive databases. They are now employed in real-world linkage applications. However, existing encoding techniques assume the quasi-identifying attributes used for encoding to be complete. Missing values can lead to incomplete encodings which can result in decreased or increased similarities and therefore to false non-matches or false matches. In this study we empirically evaluate three binary encoding techniques using real voter databases, where pairs of records that correspond to the same voter (with name or address changes) resulted in files of 100,000 and 500,000 records containing from 0% to 50% missing values. ResultsWe encoded between two and four of the attributes first and last name, street, and city into three record-level binary encodings: Cryptographic long-term key (CLK) [Schnell et al. 2009], record-level Bloom filter (RBF) [Durham et al. 2014], and tabulation Min-hashing (TBH) [Smith 2017]. Experiments showed a 10% to 25% drop on average in both precision and recall for all encoding techniques when missing values are increasing. CLK resulted in the highest decrease in precision, while TBH resulted in the highest decrease in recall compared to the other encoding techniques. ConclusionBinary encodings such as Bloom filters are now used in practical applications for linking sensitive databases. Our evaluation shows that such encoding techniques can result in lower linkage quality if there are missing values in quasi-identifying attributes. This highlights the need for novel encoding techniques that can overcome the challenge of missing values.

Download Full-text

Securing Bloom Filters for Privacy-preserving Record Linkage

Proceedings of the 29th ACM International Conference on Information & Knowledge Management ◽

10.1145/3340531.3412105 ◽

2020 ◽

Author(s):

Thilina Ranbaduge ◽

Rainer Schnell

Keyword(s):

Record Linkage ◽

Privacy Preserving ◽

Bloom Filters

Download Full-text

A framework for consensual and online privacy preserving record linkage in real-time

2015 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2015.7364057 ◽

2015 ◽

Cited By ~ 1

Author(s):

Daniel Muller ◽

Stefan Mau ◽

Irena Pletikosa Cvijikj

Keyword(s):

Real Time ◽

Record Linkage ◽

Privacy Preserving ◽

Online Privacy

Download Full-text

Towards Privacy-Preserving Record Linkage with Record-Wise Linkage Policy

Lecture Notes in Computer Science - Database and Expert Systems Applications ◽

10.1007/978-3-319-64468-4_18 ◽

2017 ◽

pp. 233-248

Author(s):

Takahito Kaiho ◽

Wen-jie Lu ◽

Toshiyuki Amagasa ◽

Jun Sakuma

Keyword(s):

Record Linkage ◽

Privacy Preserving

Download Full-text

Encoding Hierarchical Classification Codes for Privacy-Preserving Record Linkage Using Bloom Filters

Machine Learning and Knowledge Discovery in Databases - Communications in Computer and Information Science ◽

10.1007/978-3-030-43887-6_12 ◽

2020 ◽

pp. 142-156

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Hierarchical Classification ◽

Privacy Preserving ◽

Bloom Filters

Download Full-text

Accuracy of probabilistic and deterministic record linkage: the case of tuberculosis

Revista de Saúde Pública ◽

10.1590/s1518-8787.2016050006327 ◽

2016 ◽

Vol 50 (0) ◽

Cited By ~ 10

Author(s):

Gisele Pinto de Oliveira ◽

Ana Luiza de Souza Bierrenbach ◽

Kenneth Rochel de Camargo Júnior ◽

Cláudia Medina Coeli ◽

Rejane Sobrino Pinheiro

Keyword(s):

Record Linkage ◽

Missing Values ◽

Probabilistic Approach ◽

Roc Curves ◽

Cutoff Point ◽

Accurate Analysis ◽

Linkage Algorithm ◽

Probabilistic Linkage ◽

Key Variables ◽

High Level

ABSTRACT OBJECTIVE To analyze the accuracy of deterministic and probabilistic record linkage to identify TB duplicate records, as well as the characteristics of discordant pairs. METHODS The study analyzed all TB records from 2009 to 2011 in the state of Rio de Janeiro. A deterministic record linkage algorithm was developed using a set of 70 rules, based on the combination of fragments of the key variables with or without modification (Soundex or substring). Each rule was formed by three or more fragments. The probabilistic approach required a cutoff point for the score, above which the links would be automatically classified as belonging to the same individual. The cutoff point was obtained by linkage of the Notifiable Diseases Information System – Tuberculosis database with itself, subsequent manual review and ROC curves and precision-recall. Sensitivity and specificity for accurate analysis were calculated. RESULTS Accuracy ranged from 87.2% to 95.2% for sensitivity and 99.8% to 99.9% for specificity for probabilistic and deterministic record linkage, respectively. The occurrence of missing values for the key variables and the low percentage of similarity measure for name and date of birth were mainly responsible for the failure to identify records of the same individual with the techniques used. CONCLUSIONS The two techniques showed a high level of correlation for pair classification. Although deterministic linkage identified more duplicate records than probabilistic linkage, the latter retrieved records not identified by the former. User need and experience should be considered when choosing the best technique to be used.

Download Full-text