Privacy preserving linkage using multiple dynamic match keys

IntroductionAvailable and practical methods for privacy preserving linkage have shortcomings: methods utilising anonymous linkage codes provide limited accuracy while methods based on Bloom filters have proven vulnerable to frequency-based attacks. ObjectivesIn this paper, we present and evaluate a novel protocol that aims to meld both the accuracy of the Bloom filter method with the privacy achievable through the anonymous linkage code methodology. MethodsThe protocol involves creating multiple match-keys for each record, with the composition of each match-key depending on attributes of the underlying datasets being compared. The protocol was evaluated through de-duplication of four administrative datasets and two synthetic datasets; the ‘answers’ outlining which records belonged to the same individual were known for each dataset. The results were compared against results achieved with un-encoded linkage and other privacy preserving techniques on the same datasets. ResultsThe multiple match-key protocol presented here achieved high quality across all datasets, performing better than record-level Bloom filters and the SLK, but worse than field-level Bloom filters. ConclusionThe presented method provides high linkage quality while avoiding the frequency based attacks that have been demonstrated against the Bloom filter approach. The method appears promising for real world use.

Download Full-text

A blinded evaluation of privacy preserving record linkage with Bloom filters

BMC Medical Research Methodology ◽

10.1186/s12874-022-01510-2 ◽

2022 ◽

Vol 22 (1) ◽

Author(s):

Sean Randall ◽

Helen Wichmann ◽

Adrian Brown ◽

James Boyd ◽

Tom Eitelhuber ◽

...

Keyword(s):

Record Linkage ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Filter Method ◽

Morbidity Data ◽

Linkage Quality ◽

Hospital Morbidity ◽

Western Australian ◽

Direct Investigation

Abstract Background Privacy preserving record linkage (PPRL) methods using Bloom filters have shown promise for use in operational linkage settings. However real-world evaluations are required to confirm their suitability in practice. Methods An extract of records from the Western Australian (WA) Hospital Morbidity Data Collection 2011–2015 and WA Death Registrations 2011–2015 were encoded to Bloom filters, and then linked using privacy-preserving methods. Results were compared to a traditional, un-encoded linkage of the same datasets using the same blocking criteria to enable direct investigation of the comparison step. The encoded linkage was carried out in a blinded setting, where there was no access to un-encoded data or a ‘truth set’. Results The PPRL method using Bloom filters provided similar linkage quality to the traditional un-encoded linkage, with 99.3% of ‘groupings’ identical between privacy preserving and clear-text linkage. Conclusion The Bloom filter method appears suitable for use in situations where clear-text identifiers cannot be provided for linkage.

Download Full-text

Evaluating Binary Encoding Techniques in The Presence of Missing Values in Privacy-Preserving Record Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1445 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Thilina Ranbaduge ◽

Peter Christen

Keyword(s):

National Security ◽

Record Linkage ◽

Missing Values ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Practical Applications ◽

Binary Encoding ◽

Linkage Quality

IntroductionApplications in domains ranging from healthcare to national security increasingly require records about individuals in sensitive databases to be linked in privacy-preserving ways. Missing values make the linkage process challenging because they can affect the encoding of attribute values. No study has systematically investigated how missing values affect the outcomes of different encoding techniques used in privacy-preserving linkage applications. Objectives and ApproachBinary encodings, such as Bloom filters, are popular for linking sensitive databases. They are now employed in real-world linkage applications. However, existing encoding techniques assume the quasi-identifying attributes used for encoding to be complete. Missing values can lead to incomplete encodings which can result in decreased or increased similarities and therefore to false non-matches or false matches. In this study we empirically evaluate three binary encoding techniques using real voter databases, where pairs of records that correspond to the same voter (with name or address changes) resulted in files of 100,000 and 500,000 records containing from 0% to 50% missing values. ResultsWe encoded between two and four of the attributes first and last name, street, and city into three record-level binary encodings: Cryptographic long-term key (CLK) [Schnell et al. 2009], record-level Bloom filter (RBF) [Durham et al. 2014], and tabulation Min-hashing (TBH) [Smith 2017]. Experiments showed a 10% to 25% drop on average in both precision and recall for all encoding techniques when missing values are increasing. CLK resulted in the highest decrease in precision, while TBH resulted in the highest decrease in recall compared to the other encoding techniques. ConclusionBinary encodings such as Bloom filters are now used in practical applications for linking sensitive databases. Our evaluation shows that such encoding techniques can result in lower linkage quality if there are missing values in quasi-identifying attributes. This highlights the need for novel encoding techniques that can overcome the challenge of missing values.

Download Full-text

Secure Privacy Preserving Record Linkage of Large Databases by Modified Bloom Filter Encodings

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.29 ◽

2017 ◽

Vol 1 (1) ◽

Cited By ~ 2

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Large Scale ◽

Bloom Filter ◽

Privacy Preserving ◽

Error Rates ◽

Bloom Filters ◽

Data Sets ◽

Research Subjects ◽

Practical Applications ◽

Large Databases

ABSTRACTObjectiveIn most European settings, record linkage across different institutions has to be based on personal identifiers such as names, birthday or place of birth. To protect the privacy of research subjects, the identifiers have to be encrypted. In practice, these identifiers show error rates up to 20% per identifier, therefore linking on encrypted identifiers usually implies the loss of large subsets of the databases. In many applications, this loss of cases is related to variables of interest for the subject matter of the study. Therefore, this kind of record-linkage will generate biased estimates. These problems gave rise to techniques of Privacy Preserving Record Linkage (PPRL). Many different PPRL techniques have been suggested within the last 10 years, very few of them are suitable for practical applications with large database containing millions of records as they are typical for administrative or medical databases. One proven technique for PPRL for large scale applications is PPRL based on Bloom filters.MethodUsing appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time.ResultHowever, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall and re-identification risk on large databases.

Download Full-text

Optimization of the Mainzelliste software for fast privacy-preserving record linkage

Journal of Translational Medicine ◽

10.1186/s12967-020-02678-1 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Florens Rohde ◽

Martin Franke ◽

Ziad Sehili ◽

Martin Lablans ◽

Erhard Rahm

Keyword(s):

Record Linkage ◽

Comprehensive Evaluation ◽

Personal Data ◽

Privacy Preserving ◽

Use Cases ◽

Locality Sensitive Hashing ◽

Third Party ◽

Bloom Filters ◽

Linkage Quality ◽

Order Of Magnitude

Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.

Download Full-text

Linking Sensitive Data – Applications, Techniques, and Challenges

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1475 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Peter Christen ◽

Thilina Ranbaduge ◽

Rainer Schnell

Keyword(s):

Social Science Research ◽

Bloom Filter ◽

Science Research ◽

Privacy Preserving ◽

Point Of View ◽

Technical Knowledge ◽

Sensitive Data ◽

Diverse Range ◽

Practical Applications ◽

Linkage Quality

IntroductionThe linking of sensitive databases containing personal identifying information across organisations is an increasingly important task in application domains ranging from health and social science research to national censuses. Various techniques have been developed to facilitate the linking of sensitive databases while at the same time preserving the privacy of individuals represented in these databases. Objectives and approachWe present several case studies where the privacy-preserving linking of sensitive databases is crucial, and then discuss the advantages and limitations of existing algorithms and techniques to link sensitive databases. We discuss privacy techniques such as Bloom filter encoding, hashing, and secure multi-party computation, from the point of view of a linkage practitioner. We highlight those aspects that are important when selecting or implementing a privacy-preserving linkage technique within practical applications. ResultsConceptually, linkage techniques can be evaluated across three main dimensions linkage quality, scalability to linking large or multiple databases, and the privacy protection provided by a technique. From a practical perspective, however, several other dimensions are crucial, including the availability of software or ease of implementation, technical knowledge available in an organisation, and the suitability of techniques for a given linkage scenario. Our analysis of a diverse range of linkage techniques has shown that currently no technique provides an adequate solution along all conceptual as well as all practical dimensions. ConclusionsMore research is required to develop novel techniques that facilitate the privacy-preserving linkage of large sensitive databases across organisations; including new encoding methods and cryptanalysis attacks (where until now most attacks have neglected the attack vectors that likely occur in practice), and novel evaluation measures to assess the privacy provided by linkage techniques. We encourage practitioners to be aware of the identified limitations – as well as the opportunities – of existing privacy-preserving linkage techniques and carefully assess the technical and organisational requirements of such techniques within their institution.

Download Full-text

Privacy-Preserving Multipoint Traffic Flow Estimation for Road Networks

Security and Communication Networks ◽

10.1155/2021/6619770 ◽

2021 ◽

Vol 2021 ◽

pp. 1-16

Author(s):

Elmahdi Bentafat ◽

M. Mazhar Rathore ◽

Spiridon Bakiras

Keyword(s):

Traffic Flow ◽

Intelligent Transportation Systems ◽

Estimation Error ◽

Bloom Filter ◽

Privacy Preserving ◽

Transportation Systems ◽

Bloom Filters ◽

Accurate Estimation ◽

Traffic Flows ◽

The Individual

Intelligent transportation systems necessitate a fine-grained and accurate estimation of vehicular traffic flows across critical paths of the underlying road network. However, such statistics should be collected in a manner that does not disclose the trajectories of individual users. To this end, we introduce a privacy-preserving protocol that leverages roadside units (RSUs) to communicate with the passing vehicles, in order to construct encrypted Bloom filters stemming from random vehicle IDs that are chosen secretly by the individual vehicles. Each Bloom filter represents the set of vehicle IDs that contacted the RSU but may also be used to estimate the traffic flow between any number of RSUs. More precisely, we designed a probabilistic model that approximates multipoint traffic flows by estimating the number of common vehicles among a given set of RSUs. Through extensive simulation experiments, we demonstrate that our protocol is very accurate—with a minor deviation from the real traffic flow—and show that it reduces the estimation error by a large factor, when compared to the current state-of-the-art approaches. Furthermore, our implementation of the underlying cryptographic primitives illustrates the feasibility, practicality, and scalability of the system.

Download Full-text

High quality linkage using Multibit Trees for privacy-preserving blocking

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.149 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Adrian Brown ◽

Christian Borgs ◽

Sean Randall ◽

Rainer Schnell

Keyword(s):

Real World ◽

New Technology ◽

Bloom Filter ◽

Privacy Preserving ◽

Sensitive Data ◽

High Quality ◽

Linkage Quality ◽

Real World Datasets ◽

Similarity Thresholds ◽

F Measure

ABSTRACT ObjectivesAs privacy-preserving record linkage (PPRL) emerges as a method for linking sensitive data, efficient blocking techniques that help maintain high levels of linkage quality are required. This research looks at the use of a Q-gram Fingerprinting blocking technique, with Multibit Trees, and applies this method to real-world datasets. ApproachData comprised ten years of hospital and mortality records from several Australian states, totalling over 25 million records. Each record contained a linkage key, as defined by the jurisdiction, which was used to assess quality (i.e. used as a ‘gold standard’). Different parameter sets were defined for the linkage tests with a privacy-preserved file created for each parameter set. The files contained jurisdictional linkage key and a Cryptographic Long-term Key (the CLK is a Bloom filter comprising all fields in the parameter set). Each file was run through an implementation of the Q-gram Fingerprinting blocking algorithm as a deduplication technique, using different similarity thresholds. The quality metrics of precision, recall and f-measure were calculated. ResultsResultant quality varied for each parameter set. Adding suburb and postcode reduced the linkage quality. The best parameter set returned an F-measure of 0.951. In general, precision was high in all settings, but recall fell as more fields were added to the CLK. We will report details for all parameter settings and their corresponding results. ConclusionThe Q-gram Fingerprinting blocking technique shows promise for maintaining high quality linkage in reasonable time. Determining which fields to include in the CLK for the linkage of specific datasets is important to maximise linkage quality, as well as selecting optimal similarity thresholds. Developing new technology is important for progressing the implementation of PPRL in real-world settings.

Download Full-text

Proof of Concept for a Privacy Preserving National Mortality Register

International Journal for Population Data Science ◽

10.23889/ijpds.v3i4.732 ◽

2018 ◽

Vol 3 (4) ◽

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

High Precision ◽

Bloom Filter ◽

Privacy Preserving ◽

University Hospital ◽

Bloom Filters ◽

National Database ◽

Federal State ◽

Mortality Data ◽

National Mortality ◽

The Given

IntroductionNational mortality registers are essential for medical research. Therefore, most nations operate such registers. Due to the administrative structure and data protection legislation, there is no such registry in Germany. We demonstrate that a national mortality registry is technically feasible under the given constraints with privacy preserving record linkage (PPRL). Objectives and ApproachGetting the legal permission to operate a national mortality registry for research will be easier if the linkage can be done without revealing personal identifiers by using PPRL. To estimate precision and recall of different encodings, we used two settings: (1) matching a local mortality registry (n = 14,003) with mortality data of a university hospital (n = 2,466); (2) matching 1 million simulated records from a national database of names with a corrupted subset. This corresponds to a match of all deceased persons with the deceased persons in the largest federal state (n = 205,000). ResultsLinkage results for clear-text identifiers show very high recall and precision. Bloom-Filter based encryptions yield comparable results. Neither precision nor recall declines more than 2%. Phonetic codes yield high precision but low recall. Some variants of Bloom Filter-based encodings yield better results than probabilistic linkage on clear-text identifiers. This is mainly due to the rarely mentioned detail of using different passwords for different identifiers in the same Bloom Filter. Therefore, implementation details of Bloom Filters are more important than commonly thought. Overall, we recommend the use of salted Bloom Filter-based methods with different passwords for different identifiers to increase security and to prevent all known attacks on identifier encryptions. Conclusion/ImplicationsAlthough most PPRL techniques would yield acceptable results in the given setting of a national register, salted Bloom filter encodings are more secure against attacks while still showing high precision and recall. Therefore, we consider a national mortality register using only encrypted identifiers of deceased persons as feasible.

Download Full-text

Cryptanalysis of Basic Bloom Filters Used for Privacy Preserving Record Linkage

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v6i2.640 ◽

2014 ◽

Vol 6 (2) ◽

Cited By ~ 23

Author(s):

Frank Niedermeyer ◽

Simone Steinmetzer ◽

Martin Kroll ◽

Rainer Schnell

Keyword(s):

Statistical Analysis ◽

Record Linkage ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Small Subset ◽

Data Set ◽

Successful Attack ◽

Global Data

Bloom filter encoded identifiers are increasingly used for privacy preserving record linkage applications, because they allow for errors in encrypted identifiers. However, little research on the security of Bloom filters has been published so far. In this paper, we formalize a successful attack on Bloom filters composed of bigrams. It has previously been assumed in the literature that an attacker knows the global data set from which a sample is drawn. In contrast, we suppose that an attacker does not know this global data set. Instead, we assume the adversary knows a publicly available list of the most frequent attributes. The attack is based on subtle filtering and elementary statistical analysis of encrypted bigrams. The attack described in this paper can be used for the deciphering of a whole database instead of only a small subset of the most frequent names, as in previous research. We illustrate our proposed method with an attack on a database of encrypted surnames. Finally, we describe modifications of the Bloom filters for preventing similar attacks.

Download Full-text

Encoding Diagnostic Codes for Privacy-Preserving Record Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1461 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Privacy Preserving ◽

New Method ◽

Bloom Filters ◽

Mortality Data ◽

Sensitive Information ◽

Diagnostic Code ◽

Relational Similarity ◽

Linkage Quality ◽

The Impact

IntroductionDiagnostic codes, such as the ICD-10, may be considered as sensitive information. If such codes have to be encoded using current methods for data linkage, all hierarchical information given by the code positions will be lost. We present a technique (HPBFs) for preserving the hierarchical information of the codes while protecting privacy. The new method modifies a widely used Privacy-preserving Record Linkage (PPRL) technique based on Bloom filters for the use with hierarchical codes. Objectives and ApproachAssessing the similarities of hierarchical codes requires considering the code positions of two codes in a given diagnostic hierarchy. The hierarchical similarities of the original diagnostic code pairs should correspond closely to the similarity of the encoded pairs of the same code. Furthermore, to assess the hierarchy-preserving properties of an encoding, the impact on similarity measures from differing code positions at all levels of the code hierarchy can be evaluated. A full match of codes should yield a higher similarity than partial matches. Finally, the new method is tested against ad-hoc solutions as an addition to a standard PPRL setup. This is done using real-world mortality data with a known link status of two databases. ResultsIn all applications for encoded ICD codes where either categorical discrimination, relational similarity or linkage quality in a PPRL setting is required, HPBFs outperform other known methods. Lower mean differences and smaller confidence intervals between clear-text codes and encrypted code pairs were observed, indicating better preservation of hierarchical similarities. Finally, using these techniques allows for much better hierarchical discrimination for partial matches. ConclusionThe new technique yields better linkage results than all other known methods to encrypt hierarchical codes. In all tests, comparing categorical discrimination, relational similarity and PPRL linkage quality, HPBFs outperformed methods currently used.

Download Full-text