Evaluation of approximate comparison methods on Bloom filters for probabilistic linkage

Introduction The need for increased privacy protection in data linkage has driven the development of privacy-preserving record linkage (PPRL) techniques. A popular technique using Bloom filters with cryptographic analyses, modifications, and hashing variations to optimise privacy has been the focus of much research in this area. With few applications of Bloom filters within a probabilistic framework, there is limited information on whether approximate matches between Bloom filtered fields can improve linkage quality. Objectives In this study, we evaluate the effectiveness of three approximate comparison methods for Bloom filters within the context of the Fellegi-Sunter model of recording linkage: Sørensen–Dice coefficient, Jaccard similarity and Hamming distance. Methods Using synthetic datasets with introduced errors to simulate datasets with a range of data quality and a large real-world administrative health dataset, the research estimated partial weight curves for converting similarity scores (for each approximate comparison method) to partial weights at both field and dataset level. Deduplication linkages were run on each dataset using these partial weight curves. This was to compare the resulting quality of the approximate comparison techniques with linkages using simple cut-off similarity values and only exact matching. Results Linkages using approximate comparisons produced significantly better quality results than those using exact comparisons only. Field level partial weight curves for a specific dataset produced the best quality results. The Sørensen-Dice coefficient and Jaccard similarity produced the most consistent results across a spectrum of synthetic and real-world datasets. Conclusion The use of Bloom filter similarity comparisons for probabilistic record linkage can produce linkage quality results which are comparable to Jaro-Winkler string similarities with unencrypted linkages. Probabilistic linkages using Bloom filters benefit significantly from the use of similarity comparisons, with partial weight curves producing the best results, even when not optimised for that particular dataset.

Download Full-text

Partial Agreements in Probabilistic Linkages

International Journal for Population Data Science ◽

10.23889/ijpds.v3i4.884 ◽

2018 ◽

Vol 3 (4) ◽

Cited By ~ 1

Author(s):

Adrian Brown ◽

Sean Randall ◽

Anna Ferrante ◽

James Boyd

Keyword(s):

Record Linkage ◽

Hamming Distance ◽

Synthetic Data ◽

Privacy Preserving ◽

Weight Functions ◽

Dice Coefficient ◽

Jaccard Similarity ◽

Value Similarity ◽

Partial Agreement ◽

Linkage Quality

IntroductionRecord linkage units around the world use probabilistic linkage techniques for routine linkage of large datasets. It is widely known how probabilities are converted to agreement and disagreement weights for each field, yet there has been little exploration of the methodology to optimally convert field similarity scores into partial weights. Objectives and ApproachString similarity comparators such as Jaro-Winkler are commonly used in traditional linkage, other comparators such as the Sorenson Dice coefficient, Jaccard similarity and Hamming distance are used in alternative privacy-preserving record linkage techniques. Determining partial weights to apply at each level of similarity is a non-trivial task. However, both types of linkages would greatly benefit from similarity to weight functions for each field that maximises the accuracy of the linkage. We evaluated several methods for computing partial agreement weights and applied these to synthetic datasets with varying levels of corruption. We then evaluated the methods on real administrative datasets. ResultsExact comparisons can miss matches where typographical errors or misspellings produce small changes in value. Similarity comparisons can reduce the number of missed matches, but may also increase the number of incorrect matches. Various results of the partial agreement methods on Jaro-Winkler, Sorenson Dice coefficient, Jaccard similarity and Hamming distance comparators will be presented. A generic function to convert similarity values to weights, created from synthetic data, can be used on most datasets with a greatly improved result in linkage quality. However, maximising the linkage quality requires the creation of similarity-to-weight functions that are optimised for each dataset. Conclusion/ImplicationsAccuracy in record linkage is vital for the correct analysis of linked data. It is even more critical in privacy-preserving record linkage where the ability for clerical review is limited. Optimised functions for converting similarities to partial weights can significantly improve the quality of linkage and should not be overlooked.

Download Full-text

Real world performance of privacy preserving record linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v3i4.990 ◽

2018 ◽

Vol 3 (4) ◽

Cited By ~ 2

Author(s):

Katie Irvine ◽

Michael Smith ◽

Reinier De Vos ◽

Adrian Brown ◽

Anna Ferrante ◽

...

Keyword(s):

Real World ◽

Record Linkage ◽

Gold Standard ◽

Secondary Care ◽

Personal Information ◽

Quality Metrics ◽

Bloom Filters ◽

Mortality Data ◽

Probabilistic Linkage ◽

Linkage Quality

IntroductionPrivacy preserving record linkage (PPRL) using encoded or hashed data has potential to enable large-scale record linkage of previously inaccessible data. With limited real-world evaluation and implementation of PPRL at scale it is challenging for linkage practitioners to judiciously balance data protection with the accuracy and usability of linked datasets. Objectives and ApproachWe evaluated the performance of PPRL techniques using Bloom filters for linkage of data across primary and secondary care settings. This technique limits the need to disclose personal information for linkage activities. Primary care data included 272,202 records from 16 general practices in NSW. This was linked to 42.8 million records from a 7 year series of emergency presentations, hospitalisations and death registrations. For the purpose of evaluation, personal information was encoded within the data linkage centre. The quality of PPRL linkage was assessed against the true match status based on a gold standard probabilistic linkage using full personal identifiers. ResultsCompared to the gold standard probabilistic linkage using full personal identifiers, the PPRL techniques produced quality metrics of precision, recall and F measure in excess of 0.90. When configured to leverage pre-existing links between emergency department, hospital and mortality data, quality metrics around 0.98-0.99 were achieved. Lower rates of linkage quality were associated with missing demographic information and some residual variation in linkage quality across practices was observed. Conclusion/ImplicationsPPRL using Bloom filters is a promising technique for achieving high quality linkage across primary and secondary care in Australia. Further evaluation will assess scalability and quality in Australia but international collaborations are encouraged to more rapidly develop the evidence base and tactical approaches to support real world implementations.

Download Full-text

Evaluation of advanced techniques for multi-party privacy-preserving record linkage on real-world health databases

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.106 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Thilina Ranbaduge ◽

Dinusha Vatsalan ◽

Sean Randall ◽

Peter Christen

Keyword(s):

Real World ◽

Record Linkage ◽

State Of The Art ◽

Privacy Preserving ◽

World Health ◽

Disclosure Risk ◽

Linkage Quality ◽

Private Matching ◽

Number Of Parties ◽

Matching Techniques

ABSTRACT ObjectiveThe linking of multiple (three or more) health databases is challenging because of the increasing sizes of databases, the number of parties among which they are to be linked, and privacy concerns related to the use of personal data such as names, addresses, or dates of birth. This entails a need to develop advanced scalable techniques for linking multiple databases while preserving the privacy of the individuals they contain. In this study we empirically evaluate several state-of-the-art multi-party privacy-preserving record linkage (MP-PPRL) techniques with large real-world health databases from Australia. ApproachMP-PPRL is conducted such that no sensitive information is revealed about database records that can be used to infer knowledge about individuals or groups of individuals. Current state-of-the-art methods used in this evaluation use Bloom filters to encode personal identifying information. The empirical evaluation comprises of different multi-party private blocking and matching techniques that are evaluated for different numbers of parties. Each database contains more than 700,000 records extracted from ten years of New South Wales (NSW) emergency presentation data. Each technique is evaluated with regard to scalability, quality and privacy. Scalability and quality are measured using the metrics of reduction ratio, pairs completeness, precision, recall, and F-measure. Privacy is measured using disclosure risk metrics that are based on the probability of suspicion, defined as the likelihood that a record in an encoded database matches to one or more record(s) in a publicly available database such as a telephone directory. MP-PPRL techniques that either utilize a trusted linkage unit, and those that do not, are evaluated. ResultsExperimental results showed MP-PPRL methods are practical for linking large-scale real world data. Private blocking techniques achieved significantly higher privacy than standard hashing-based techniques with a maximum disclosure risk of 0.0003 and 1, respectively, at a small cost to linkage quality and efficiency. Similarly, private matching techniques provided a similar acceptable reduction in linkage quality compared to standard non-private matching while providing high privacy protection. ConclusionThe adoption of privacy-preserving linkage methods has the ability to significantly reduce privacy risks associated with linking large health databases, and enable the data linkage community to offer operational linkage services not previously possible. The evaluation results show that these state-of-the-art MP-PPRL techniques are scalable in terms of database sizes and number of parties, while providing significantly improved privacy with an associated trade-off in linkage quality compared to standard linkage techniques.

Download Full-text

Optimization of the Mainzelliste software for fast privacy-preserving record linkage

Journal of Translational Medicine ◽

10.1186/s12967-020-02678-1 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Florens Rohde ◽

Martin Franke ◽

Ziad Sehili ◽

Martin Lablans ◽

Erhard Rahm

Keyword(s):

Record Linkage ◽

Comprehensive Evaluation ◽

Personal Data ◽

Privacy Preserving ◽

Use Cases ◽

Locality Sensitive Hashing ◽

Third Party ◽

Bloom Filters ◽

Linkage Quality ◽

Order Of Magnitude

Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.

Download Full-text

A blinded evaluation of privacy preserving record linkage with Bloom filters

BMC Medical Research Methodology ◽

10.1186/s12874-022-01510-2 ◽

2022 ◽

Vol 22 (1) ◽

Author(s):

Sean Randall ◽

Helen Wichmann ◽

Adrian Brown ◽

James Boyd ◽

Tom Eitelhuber ◽

...

Keyword(s):

Record Linkage ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Filter Method ◽

Morbidity Data ◽

Linkage Quality ◽

Hospital Morbidity ◽

Western Australian ◽

Direct Investigation

Abstract Background Privacy preserving record linkage (PPRL) methods using Bloom filters have shown promise for use in operational linkage settings. However real-world evaluations are required to confirm their suitability in practice. Methods An extract of records from the Western Australian (WA) Hospital Morbidity Data Collection 2011–2015 and WA Death Registrations 2011–2015 were encoded to Bloom filters, and then linked using privacy-preserving methods. Results were compared to a traditional, un-encoded linkage of the same datasets using the same blocking criteria to enable direct investigation of the comparison step. The encoded linkage was carried out in a blinded setting, where there was no access to un-encoded data or a ‘truth set’. Results The PPRL method using Bloom filters provided similar linkage quality to the traditional un-encoded linkage, with 99.3% of ‘groupings’ identical between privacy preserving and clear-text linkage. Conclusion The Bloom filter method appears suitable for use in situations where clear-text identifiers cannot be provided for linkage.

Download Full-text

Overcoming the Impasse 2: Assessing the Quality of Recent Australian Applications of a Privacy-Preserving Record Linkage Method (PPRL-BLOOM)

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1489 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Sean Randall ◽

Adrian Brown ◽

Anna Ferrante ◽

James Boyd ◽

Katie Irvine ◽

...

Keyword(s):

Real World ◽

Child Protection ◽

Record Linkage ◽

Data Linkage ◽

Privacy Preserving ◽

Third Party ◽

Regulatory Constraints ◽

Linkage Quality ◽

Personally Identifying Information

IntroductionWhile the quantity and type of datasets used by data linkage projects is growing, there remain some datasets that are ‘not available’ or ‘hard to access’ by researchers and linkers, either due to legal/regulatory constraints restricting the release of personally identifying information or because of privacy or reputational concerns. Advances in privacy-preserving record linkage methods (e.g. PPRL-Bloom) have made it possible to overcome this impasse. These techniques aim to provide strong privacy protection while still maintaining high linkage quality. PPRL-Bloom methods are being used in practice. The Centre for Data Linkage (CDL) at Curtin University has been involved in several PPRL linkage and evaluation projects using real-world data. As the methods are relatively new, published information on achievable linkage quality in real-world scenarios is limited. Objectives and ApproachWe present and describe several real-world applications of privacy preserving record linkage (PPRL-Bloom) where the quality of the linkage could be ascertained. In each case, data was linked ‘blind’; that is, without linkers having access to the original personal identifiers at any stage, or having any additional information about the records. Evaluations include a linkage of state-based morbidity and mortality records, a linkage of a number of general practice datasets to morbidity and emergency records, and a linkage of a range of state-based non-health administrative data, including education, police, housing, birth and child protection records. ResultsThe privacy preserving record linkage performed admirably, with very high-quality results across all evaluations. Conclusion / ImplicationsPrivacy preserving linkage is a useful and innovative methodology that is currently being used in real world projects. The results of these evaluation suggest it can be an appropriate linkage tool when legal or other constraints block release of personally identifying information to third party linkage units.

Download Full-text

Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1445 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Thilina Ranbaduge ◽

Peter Christen

Keyword(s):

National Security ◽

Record Linkage ◽

Missing Values ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Practical Applications ◽

Binary Encoding ◽

Linkage Quality

IntroductionApplications in domains ranging from healthcare to national security increasingly require records about individuals in sensitive databases to be linked in privacy-preserving ways. Missing values make the linkage process challenging because they can affect the encoding of attribute values. No study has systematically investigated how missing values affect the outcomes of different encoding techniques used in privacy-preserving linkage applications. Objectives and ApproachBinary encodings, such as Bloom filters, are popular for linking sensitive databases. They are now employed in real-world linkage applications. However, existing encoding techniques assume the quasi-identifying attributes used for encoding to be complete. Missing values can lead to incomplete encodings which can result in decreased or increased similarities and therefore to false non-matches or false matches. In this study we empirically evaluate three binary encoding techniques using real voter databases, where pairs of records that correspond to the same voter (with name or address changes) resulted in files of 100,000 and 500,000 records containing from 0% to 50% missing values. ResultsWe encoded between two and four of the attributes first and last name, street, and city into three record-level binary encodings: Cryptographic long-term key (CLK) [Schnell et al. 2009], record-level Bloom filter (RBF) [Durham et al. 2014], and tabulation Min-hashing (TBH) [Smith 2017]. Experiments showed a 10% to 25% drop on average in both precision and recall for all encoding techniques when missing values are increasing. CLK resulted in the highest decrease in precision, while TBH resulted in the highest decrease in recall compared to the other encoding techniques. ConclusionBinary encodings such as Bloom filters are now used in practical applications for linking sensitive databases. Our evaluation shows that such encoding techniques can result in lower linkage quality if there are missing values in quasi-identifying attributes. This highlights the need for novel encoding techniques that can overcome the challenge of missing values.

Download Full-text

An Evaluation Framework for Privacy-Preserving Record Linkage

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v6i1.636 ◽

2014 ◽

Vol 6 (1) ◽

Cited By ~ 20

Author(s):

Dinusha Vatsalan ◽

Peter Christen ◽

Christine M. O'Keefe ◽

Vassilios S. Verykios

Keyword(s):

Real World ◽

Record Linkage ◽

Comparative Evaluation ◽

Privacy Preserving ◽

Evaluation Framework ◽

Sensitive Information ◽

The Past ◽

Large Databases ◽

Linkage Quality ◽

The Face

Privacy-preserving record linkage (PPRL) addresses the problem of identifying matching records from different databases that correspond to the same real-world entities using quasi-identifying attributes (in the absence of unique entity identifiers), while preserving privacy of these entities. Privacy is being preserved by not revealing any information that could be used to infer the actual values about the records that are not reconciled to the same entity (non-matches), and any confidential or sensitive information (that is not agreed upon by the data custodians) about the records that were reconciled to the same entity (matches) during or after the linkage process. The PPRL process often involves three main challenges, which are scalability to large databases, high linkage quality in the presence of data quality errors, and sufficient privacy guarantees. While many solutions have been developed for the PPRL problem over the past two decades, an evaluation and comparison framework of PPRL solutions with standard numerical measures defined for all three properties (scalability, linkage quality, and privacy) of PPRL has so far not been presented in the literature. We propose a general framework with normalized measures to practically evaluate and compare PPRL solutions in the face of linkage attack methods that are based on an external global dataset. We conducted experiments of several existing PPRL solutions on real-world databases using our proposed evaluation framework, and the results show that our framework provides an extensive and comparative evaluation of PPRL solutions in terms of the three properties.

Download Full-text

Towards Auditable and Intelligent Privacy-Preserving Record Linkage

10.5753/sbbd_estendido.2021.18170 ◽

2021 ◽

Author(s):

Thiago Nóbrega ◽

Carlos Eduardo S. Pires ◽

Dimas Cassimiro Nascimento

Keyword(s):

Real World ◽

Record Linkage ◽

Personal Information ◽

Privacy Preserving ◽

Data Sources ◽

Machine Learning Techniques ◽

Sensitive Data ◽

General Data Protection Regulation ◽

Assistance Programs ◽

Linkage Quality

Privacy-Preserving Record Linkage (PPRL) intends to integrate private/sensitive data from several data sources held by different parties. It aims to identify records (e.g., persons or objects) representing the same real-world entity over private data sources held by different custodians. Due to recent laws and regulations (e.g., General Data Protection Regulation), PPRL approaches are increasingly demanded in real-world application areas such as health care, credit analysis, public policy evaluation, and national security. As a result, the PPRL process needs to deal with efficacy (linkage quality), and privacy problems. For instance, the PPRL process needs to be executed over data sources (e.g., a database containing personal information of governmental income distribution and assistance programs), with an accurate linkage of the entities, and, at the same time, protect the privacy of the information. Thus, this work intends to simplify the PPRL process by facilitating real-world applications (such as medical, epidemiologic, and populational studies) to reduce legal and bureaucratic efforts to access and process the data, making these applications' execution more straightforward for companies and governments. In this context, this work presents two major contributions to PPRL: i) an improvement to the linkage quality and simplify the process by employing Machine Learning techniques to decide whether two records represent the same entity, or not; and ii) we enable the auditability the computations performed during PPRL.

Download Full-text

Encoding Diagnostic Codes for Privacy-Preserving Record Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1461 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Privacy Preserving ◽

New Method ◽

Bloom Filters ◽

Mortality Data ◽

Sensitive Information ◽

Diagnostic Code ◽

Relational Similarity ◽

Linkage Quality ◽

The Impact

IntroductionDiagnostic codes, such as the ICD-10, may be considered as sensitive information. If such codes have to be encoded using current methods for data linkage, all hierarchical information given by the code positions will be lost. We present a technique (HPBFs) for preserving the hierarchical information of the codes while protecting privacy. The new method modifies a widely used Privacy-preserving Record Linkage (PPRL) technique based on Bloom filters for the use with hierarchical codes. Objectives and ApproachAssessing the similarities of hierarchical codes requires considering the code positions of two codes in a given diagnostic hierarchy. The hierarchical similarities of the original diagnostic code pairs should correspond closely to the similarity of the encoded pairs of the same code. Furthermore, to assess the hierarchy-preserving properties of an encoding, the impact on similarity measures from differing code positions at all levels of the code hierarchy can be evaluated. A full match of codes should yield a higher similarity than partial matches. Finally, the new method is tested against ad-hoc solutions as an addition to a standard PPRL setup. This is done using real-world mortality data with a known link status of two databases. ResultsIn all applications for encoded ICD codes where either categorical discrimination, relational similarity or linkage quality in a PPRL setting is required, HPBFs outperform other known methods. Lower mean differences and smaller confidence intervals between clear-text codes and encrypted code pairs were observed, indicating better preservation of hierarchical similarities. Finally, using these techniques allows for much better hierarchical discrimination for partial matches. ConclusionThe new technique yields better linkage results than all other known methods to encrypt hierarchical codes. In all tests, comparing categorical discrimination, relational similarity and PPRL linkage quality, HPBFs outperformed methods currently used.

Download Full-text