Privacy-preserving record linkage on large real world datasets

Abstract Objective To implement an open-source tool that performs deterministic privacy-preserving record linkage (RL) in a real-world setting within a large research network. Materials and Methods We learned 2 efficient deterministic linkage rules using publicly available voter registration data. We then validated the 2 rules’ performance with 2 manually curated gold-standard datasets linking electronic health records and claims data from 2 sources. We developed an open-source Python-based tool—OneFL Deduper—that (1) creates seeded hash codes of combinations of patients’ quasi-identifiers using a cryptographic one-way hash function to achieve privacy protection and (2) links and deduplicates patient records using a central broker through matching of hash codes with a high precision and reasonable recall. Results We deployed the OneFl Deduper (https://github.com/ufbmi/onefl-deduper) in the OneFlorida, a state-based clinical research network as part of the national Patient-Centered Clinical Research Network (PCORnet). Using the gold-standard datasets, we achieved a precision of 97.25∼99.7% and a recall of 75.5%. With the tool, we deduplicated ∼3.5 million (out of ∼15 million) records down to 1.7 million unique patients across 6 health care partners and the Florida Medicaid program. We demonstrated the benefits of RL through examining different disease profiles of the linked cohorts. Conclusions Many factors including privacy risk considerations, policies and regulations, data availability and quality, and computing resources, can impact how a RL solution is constructed in a real-world setting. Nevertheless, RL is a significant task in improving the data quality in a network so that we can draw reliable scientific discoveries from these massive data resources.

Download Full-text

Evaluation of advanced techniques for multi-party privacy-preserving record linkage on real-world health databases

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.106 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Thilina Ranbaduge ◽

Dinusha Vatsalan ◽

Sean Randall ◽

Peter Christen

Keyword(s):

Real World ◽

Record Linkage ◽

State Of The Art ◽

Privacy Preserving ◽

World Health ◽

Disclosure Risk ◽

Linkage Quality ◽

Private Matching ◽

Number Of Parties ◽

Matching Techniques

ABSTRACT ObjectiveThe linking of multiple (three or more) health databases is challenging because of the increasing sizes of databases, the number of parties among which they are to be linked, and privacy concerns related to the use of personal data such as names, addresses, or dates of birth. This entails a need to develop advanced scalable techniques for linking multiple databases while preserving the privacy of the individuals they contain. In this study we empirically evaluate several state-of-the-art multi-party privacy-preserving record linkage (MP-PPRL) techniques with large real-world health databases from Australia. ApproachMP-PPRL is conducted such that no sensitive information is revealed about database records that can be used to infer knowledge about individuals or groups of individuals. Current state-of-the-art methods used in this evaluation use Bloom filters to encode personal identifying information. The empirical evaluation comprises of different multi-party private blocking and matching techniques that are evaluated for different numbers of parties. Each database contains more than 700,000 records extracted from ten years of New South Wales (NSW) emergency presentation data. Each technique is evaluated with regard to scalability, quality and privacy. Scalability and quality are measured using the metrics of reduction ratio, pairs completeness, precision, recall, and F-measure. Privacy is measured using disclosure risk metrics that are based on the probability of suspicion, defined as the likelihood that a record in an encoded database matches to one or more record(s) in a publicly available database such as a telephone directory. MP-PPRL techniques that either utilize a trusted linkage unit, and those that do not, are evaluated. ResultsExperimental results showed MP-PPRL methods are practical for linking large-scale real world data. Private blocking techniques achieved significantly higher privacy than standard hashing-based techniques with a maximum disclosure risk of 0.0003 and 1, respectively, at a small cost to linkage quality and efficiency. Similarly, private matching techniques provided a similar acceptable reduction in linkage quality compared to standard non-private matching while providing high privacy protection. ConclusionThe adoption of privacy-preserving linkage methods has the ability to significantly reduce privacy risks associated with linking large health databases, and enable the data linkage community to offer operational linkage services not previously possible. The evaluation results show that these state-of-the-art MP-PPRL techniques are scalable in terms of database sizes and number of parties, while providing significantly improved privacy with an associated trade-off in linkage quality compared to standard linkage techniques.

Download Full-text

Implementing privacy-preserving record linkage: welcome to the real world

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.153 ◽

2017 ◽

Vol 1 (1) ◽

Cited By ~ 1

Author(s):

James Boyd ◽

Anna Ferrante ◽

Adrian Brown ◽

Sean Randall ◽

James Semmens

Keyword(s):

Data Integration ◽

Real World ◽

Error Detection ◽

Record Linkage ◽

Large Scale ◽

Privacy Preserving ◽

Estimation Methods ◽

Experimental Conditions ◽

Encrypted Data ◽

Personally Identifying Information

ABSTRACT ObjectivesWhile record linkage has become a strategic research priority within Australia and internationally, legal and administrative issues prevent data linkage in some situations due to privacy concerns. Even current best practices in record linkage carry some privacy risk as they require the release of personally identifying information to trusted third parties. Application of record linkage systems that do not require the release of personal information can overcome legal and privacy issues surrounding data integration. Current conceptual and experimental privacy-preserving record linkage (PPRL) models show promise in addressing data integration challenges but do not yet address all of the requirements for real-world operations. This paper aims to identify and address some of the challenges of operationalising PPRL frameworks. ApproachTraditional linkage processes involve comparing personally identifying information (name, address, date of birth) on pairs of records to determine whether the records belong to the same person. Designing appropriate linkage strategies is an important part of the process. These are typically based on the analysis of data attributes (metadata) such as data completeness, consistency, constancy and field discriminating power. Under a PPRL model, however, these factors cannot be discerned from the encrypted data, so an alternative approach is required. This paper explores methods for data profiling, blocking, weight/threshold estimation and error detection within a PPRL framework. ResultsProbabilistic record linkage typically involves the estimation of weights and thresholds to optimise the linkage and ensure highly accurate results. The paper outlines the metadata requirements and automated methods necessary to collect data without compromising privacy. We present work undertaken to develop parameter estimation methods which can help optimise a linkage strategy without the release of personally identifiable information. These are required in all parts of the privacy preserving record linkage process (pre-processing, standardising activities, linkage, grouping and extracting). ConclusionsPPRL techniques that operate on encrypted data have the potential for large-scale record linkage, performing both accurately and efficiently under experimental conditions. Our research has advanced the current state of PPRL with a framework for secure record linkage that can be implemented to improve and expand linkage service delivery while protecting an individual’s privacy. However, more research is required to supplement this technique with additional elements to ensure the end-to-end method is practical and can be incorporated into real-world models.

Download Full-text

An Automatic Blocking Keys Selection For Efficient Record Linkage

International Journal of Organizational and Collective Intelligence ◽

10.4018/ijoci.2021010104 ◽

2021 ◽

Vol 11 (1) ◽

pp. 53-70

Author(s):

Hamid Naceur Benkhlaed ◽

Djamal Berrabah ◽

Nassima Dif ◽

Faouzi Boufares

Keyword(s):

Feature Selection ◽

Data Quality ◽

Optimization Algorithm ◽

Real World ◽

Record Linkage ◽

Domain Expert ◽

Bald Eagles ◽

Unsupervised Approach ◽

Selection For ◽

Real World Datasets

One of the important processes in the data quality field is record linkage (RL). RL (also known as entity resolution) is the process of detecting duplicates that refer to the same real-world entity in one or more datasets. The most critical step during the RL process is blocking, which reduces the quadratic complexity of the process by dividing the data into a set of blocks. By that way, matching is done only between the records in the same block. However, selecting the best blocking keys to divide the data is a hard task, and in most cases, it's done by a domain expert. In this paper, a novel unsupervised approach for an automatic blocking key selection is proposed. This approach is based on the recently proposed meta-heuristic bald eagles search (bes) optimization algorithm, where the problem is treated as a feature selection case. The obtained results from experiments on real-world datasets showed the efficiency of the proposition where the BES for feature selection outperformed existed approaches in the literature and returned the best blocking keys.

Download Full-text

A Comparison of Statistical Linkage Keys with Bloom Filter-based Encryptions for Privacy-preserving Record Linkage using Real-world Mammography Data

Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies ◽

10.5220/0006140302760283 ◽

2017 ◽

Cited By ~ 2

Author(s):

Rainer Schnell ◽

Anke Richter ◽

Christian Borgs

Keyword(s):

Real World ◽

Record Linkage ◽

Bloom Filter ◽

Privacy Preserving

Download Full-text

High quality linkage using Multibit Trees for privacy-preserving blocking

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.149 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Adrian Brown ◽

Christian Borgs ◽

Sean Randall ◽

Rainer Schnell

Keyword(s):

Real World ◽

New Technology ◽

Bloom Filter ◽

Privacy Preserving ◽

Sensitive Data ◽

High Quality ◽

Linkage Quality ◽

Real World Datasets ◽

Similarity Thresholds ◽

F Measure

ABSTRACT ObjectivesAs privacy-preserving record linkage (PPRL) emerges as a method for linking sensitive data, efficient blocking techniques that help maintain high levels of linkage quality are required. This research looks at the use of a Q-gram Fingerprinting blocking technique, with Multibit Trees, and applies this method to real-world datasets. ApproachData comprised ten years of hospital and mortality records from several Australian states, totalling over 25 million records. Each record contained a linkage key, as defined by the jurisdiction, which was used to assess quality (i.e. used as a ‘gold standard’). Different parameter sets were defined for the linkage tests with a privacy-preserved file created for each parameter set. The files contained jurisdictional linkage key and a Cryptographic Long-term Key (the CLK is a Bloom filter comprising all fields in the parameter set). Each file was run through an implementation of the Q-gram Fingerprinting blocking algorithm as a deduplication technique, using different similarity thresholds. The quality metrics of precision, recall and f-measure were calculated. ResultsResultant quality varied for each parameter set. Adding suburb and postcode reduced the linkage quality. The best parameter set returned an F-measure of 0.951. In general, precision was high in all settings, but recall fell as more fields were added to the CLK. We will report details for all parameter settings and their corresponding results. ConclusionThe Q-gram Fingerprinting blocking technique shows promise for maintaining high quality linkage in reasonable time. Determining which fields to include in the CLK for the linkage of specific datasets is important to maximise linkage quality, as well as selecting optimal similarity thresholds. Developing new technology is important for progressing the implementation of PPRL in real-world settings.

Download Full-text

Overcoming the Impasse 2: Assessing the Quality of Recent Australian Applications of a Privacy-Preserving Record Linkage Method (PPRL-BLOOM)

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1489 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Sean Randall ◽

Adrian Brown ◽

Anna Ferrante ◽

James Boyd ◽

Katie Irvine ◽

...

Keyword(s):

Real World ◽

Child Protection ◽

Record Linkage ◽

Data Linkage ◽

Privacy Preserving ◽

Third Party ◽

Regulatory Constraints ◽

Linkage Quality ◽

Personally Identifying Information

IntroductionWhile the quantity and type of datasets used by data linkage projects is growing, there remain some datasets that are ‘not available’ or ‘hard to access’ by researchers and linkers, either due to legal/regulatory constraints restricting the release of personally identifying information or because of privacy or reputational concerns. Advances in privacy-preserving record linkage methods (e.g. PPRL-Bloom) have made it possible to overcome this impasse. These techniques aim to provide strong privacy protection while still maintaining high linkage quality. PPRL-Bloom methods are being used in practice. The Centre for Data Linkage (CDL) at Curtin University has been involved in several PPRL linkage and evaluation projects using real-world data. As the methods are relatively new, published information on achievable linkage quality in real-world scenarios is limited. Objectives and ApproachWe present and describe several real-world applications of privacy preserving record linkage (PPRL-Bloom) where the quality of the linkage could be ascertained. In each case, data was linked ‘blind’; that is, without linkers having access to the original personal identifiers at any stage, or having any additional information about the records. Evaluations include a linkage of state-based morbidity and mortality records, a linkage of a number of general practice datasets to morbidity and emergency records, and a linkage of a range of state-based non-health administrative data, including education, police, housing, birth and child protection records. ResultsThe privacy preserving record linkage performed admirably, with very high-quality results across all evaluations. Conclusion / ImplicationsPrivacy preserving linkage is a useful and innovative methodology that is currently being used in real world projects. The results of these evaluation suggest it can be an appropriate linkage tool when legal or other constraints block release of personally identifying information to third party linkage units.

Download Full-text

An Evaluation Framework for Privacy-Preserving Record Linkage

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v6i1.636 ◽

2014 ◽

Vol 6 (1) ◽

Cited By ~ 20

Author(s):

Dinusha Vatsalan ◽

Peter Christen ◽

Christine M. O'Keefe ◽

Vassilios S. Verykios

Keyword(s):

Real World ◽

Record Linkage ◽

Comparative Evaluation ◽

Privacy Preserving ◽

Evaluation Framework ◽

Sensitive Information ◽

The Past ◽

Large Databases ◽

Linkage Quality ◽

The Face

Privacy-preserving record linkage (PPRL) addresses the problem of identifying matching records from different databases that correspond to the same real-world entities using quasi-identifying attributes (in the absence of unique entity identifiers), while preserving privacy of these entities. Privacy is being preserved by not revealing any information that could be used to infer the actual values about the records that are not reconciled to the same entity (non-matches), and any confidential or sensitive information (that is not agreed upon by the data custodians) about the records that were reconciled to the same entity (matches) during or after the linkage process. The PPRL process often involves three main challenges, which are scalability to large databases, high linkage quality in the presence of data quality errors, and sufficient privacy guarantees. While many solutions have been developed for the PPRL problem over the past two decades, an evaluation and comparison framework of PPRL solutions with standard numerical measures defined for all three properties (scalability, linkage quality, and privacy) of PPRL has so far not been presented in the literature. We propose a general framework with normalized measures to practically evaluate and compare PPRL solutions in the face of linkage attack methods that are based on an external global dataset. We conducted experiments of several existing PPRL solutions on real-world databases using our proposed evaluation framework, and the results show that our framework provides an extensive and comparative evaluation of PPRL solutions in terms of the three properties.

Download Full-text

Towards Auditable and Intelligent Privacy-Preserving Record Linkage

10.5753/sbbd_estendido.2021.18170 ◽

2021 ◽

Author(s):

Thiago Nóbrega ◽

Carlos Eduardo S. Pires ◽

Dimas Cassimiro Nascimento

Keyword(s):

Real World ◽

Record Linkage ◽

Personal Information ◽

Privacy Preserving ◽

Data Sources ◽

Machine Learning Techniques ◽

Sensitive Data ◽

General Data Protection Regulation ◽

Assistance Programs ◽

Linkage Quality

Privacy-Preserving Record Linkage (PPRL) intends to integrate private/sensitive data from several data sources held by different parties. It aims to identify records (e.g., persons or objects) representing the same real-world entity over private data sources held by different custodians. Due to recent laws and regulations (e.g., General Data Protection Regulation), PPRL approaches are increasingly demanded in real-world application areas such as health care, credit analysis, public policy evaluation, and national security. As a result, the PPRL process needs to deal with efficacy (linkage quality), and privacy problems. For instance, the PPRL process needs to be executed over data sources (e.g., a database containing personal information of governmental income distribution and assistance programs), with an accurate linkage of the entities, and, at the same time, protect the privacy of the information. Thus, this work intends to simplify the PPRL process by facilitating real-world applications (such as medical, epidemiologic, and populational studies) to reduce legal and bureaucratic efforts to access and process the data, making these applications' execution more straightforward for companies and governments. In this context, this work presents two major contributions to PPRL: i) an improvement to the linkage quality and simplify the process by employing Machine Learning techniques to decide whether two records represent the same entity, or not; and ii) we enable the auditability the computations performed during PPRL.

Download Full-text

Comparing Record Linkage methods for real-world perinatal and neonatal data without unique identifiers

International Journal for Population Data Science ◽

10.23889/ijpds.v4i3.1244 ◽

2019 ◽

Vol 4 (3) ◽

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Clinical Practice ◽

Real World ◽

Record Linkage ◽

Incomplete Data ◽

Privacy Preserving ◽

Administrative Databases ◽

National Database ◽

Linkage Methods ◽

Standard Patient ◽

The Impact

BackgroundData on newborns is regularly linked for epidemiological research. However, hospital data often suffers from incomplete data. We report on a linkage of two population-covering administrative health databases containing neonatal and perinatal data without unique personal identifiers and with incomplete information in standard patient identifiers. GoalTo study the effects of a policy-induced change from linking a national database without standard patient identifiers to a privacy-preserving Record Linkage method, we compare the linkage system in use to clear-text and privacy-preserving Record Linkage techniques. We expected large proportions of missing identifiers since they are not needed for clinical practice. Therefore, we expected missing links caused by missing identifiers. To study the impact of these missing identifiers on these successful links, we compared several linkage methods. Furthermore, we study the variations of linkage success between hospitals. MethodsPerinatal and neonatal data from population-covering real-world administrative databases was linked using several variants of state of the art methods, including Privacy-preserving Record Linkage (PPRL) techniques such as multiple match keys and Bloom filter methods. Results We report on the variation of linkage results between the hospitals and give possible explanations for the differences. The resulting linkage success is reported for each method. The impact of incomplete data on linkage success for each method is documented. Finally, we report on the relative performance of the modified techniques compared to standard linkage procedures used in practice. ConclusionImplementing a record linkage system based on identifiers not required for clinical practice caused a large number of missing identifiers. Since this information is essential for successful clear-text and private linkage methods, emphasizing the need for documenting patient identifiers, especially in cases where auxiliary information (such as stable addresses, date of birth or health insurance numbers) are missing, is of central importance for implementing a privacy-preserving Record Linkage system.

Download Full-text