Towards Auditable and Intelligent Privacy-Preserving Record Linkage

Privacy-Preserving Record Linkage (PPRL) intends to integrate private/sensitive data from several data sources held by different parties. It aims to identify records (e.g., persons or objects) representing the same real-world entity over private data sources held by different custodians. Due to recent laws and regulations (e.g., General Data Protection Regulation), PPRL approaches are increasingly demanded in real-world application areas such as health care, credit analysis, public policy evaluation, and national security. As a result, the PPRL process needs to deal with efficacy (linkage quality), and privacy problems. For instance, the PPRL process needs to be executed over data sources (e.g., a database containing personal information of governmental income distribution and assistance programs), with an accurate linkage of the entities, and, at the same time, protect the privacy of the information. Thus, this work intends to simplify the PPRL process by facilitating real-world applications (such as medical, epidemiologic, and populational studies) to reduce legal and bureaucratic efforts to access and process the data, making these applications' execution more straightforward for companies and governments. In this context, this work presents two major contributions to PPRL: i) an improvement to the linkage quality and simplify the process by employing Machine Learning techniques to decide whether two records represent the same entity, or not; and ii) we enable the auditability the computations performed during PPRL.

Download Full-text

Evaluation of advanced techniques for multi-party privacy-preserving record linkage on real-world health databases

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.106 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Thilina Ranbaduge ◽

Dinusha Vatsalan ◽

Sean Randall ◽

Peter Christen

Keyword(s):

Real World ◽

Record Linkage ◽

State Of The Art ◽

Privacy Preserving ◽

World Health ◽

Disclosure Risk ◽

Linkage Quality ◽

Private Matching ◽

Number Of Parties ◽

Matching Techniques

ABSTRACT ObjectiveThe linking of multiple (three or more) health databases is challenging because of the increasing sizes of databases, the number of parties among which they are to be linked, and privacy concerns related to the use of personal data such as names, addresses, or dates of birth. This entails a need to develop advanced scalable techniques for linking multiple databases while preserving the privacy of the individuals they contain. In this study we empirically evaluate several state-of-the-art multi-party privacy-preserving record linkage (MP-PPRL) techniques with large real-world health databases from Australia. ApproachMP-PPRL is conducted such that no sensitive information is revealed about database records that can be used to infer knowledge about individuals or groups of individuals. Current state-of-the-art methods used in this evaluation use Bloom filters to encode personal identifying information. The empirical evaluation comprises of different multi-party private blocking and matching techniques that are evaluated for different numbers of parties. Each database contains more than 700,000 records extracted from ten years of New South Wales (NSW) emergency presentation data. Each technique is evaluated with regard to scalability, quality and privacy. Scalability and quality are measured using the metrics of reduction ratio, pairs completeness, precision, recall, and F-measure. Privacy is measured using disclosure risk metrics that are based on the probability of suspicion, defined as the likelihood that a record in an encoded database matches to one or more record(s) in a publicly available database such as a telephone directory. MP-PPRL techniques that either utilize a trusted linkage unit, and those that do not, are evaluated. ResultsExperimental results showed MP-PPRL methods are practical for linking large-scale real world data. Private blocking techniques achieved significantly higher privacy than standard hashing-based techniques with a maximum disclosure risk of 0.0003 and 1, respectively, at a small cost to linkage quality and efficiency. Similarly, private matching techniques provided a similar acceptable reduction in linkage quality compared to standard non-private matching while providing high privacy protection. ConclusionThe adoption of privacy-preserving linkage methods has the ability to significantly reduce privacy risks associated with linking large health databases, and enable the data linkage community to offer operational linkage services not previously possible. The evaluation results show that these state-of-the-art MP-PPRL techniques are scalable in terms of database sizes and number of parties, while providing significantly improved privacy with an associated trade-off in linkage quality compared to standard linkage techniques.

Download Full-text

Real world performance of privacy preserving record linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v3i4.990 ◽

2018 ◽

Vol 3 (4) ◽

Cited By ~ 2

Author(s):

Katie Irvine ◽

Michael Smith ◽

Reinier De Vos ◽

Adrian Brown ◽

Anna Ferrante ◽

...

Keyword(s):

Real World ◽

Record Linkage ◽

Gold Standard ◽

Secondary Care ◽

Personal Information ◽

Quality Metrics ◽

Bloom Filters ◽

Mortality Data ◽

Probabilistic Linkage ◽

Linkage Quality

IntroductionPrivacy preserving record linkage (PPRL) using encoded or hashed data has potential to enable large-scale record linkage of previously inaccessible data. With limited real-world evaluation and implementation of PPRL at scale it is challenging for linkage practitioners to judiciously balance data protection with the accuracy and usability of linked datasets. Objectives and ApproachWe evaluated the performance of PPRL techniques using Bloom filters for linkage of data across primary and secondary care settings. This technique limits the need to disclose personal information for linkage activities. Primary care data included 272,202 records from 16 general practices in NSW. This was linked to 42.8 million records from a 7 year series of emergency presentations, hospitalisations and death registrations. For the purpose of evaluation, personal information was encoded within the data linkage centre. The quality of PPRL linkage was assessed against the true match status based on a gold standard probabilistic linkage using full personal identifiers. ResultsCompared to the gold standard probabilistic linkage using full personal identifiers, the PPRL techniques produced quality metrics of precision, recall and F measure in excess of 0.90. When configured to leverage pre-existing links between emergency department, hospital and mortality data, quality metrics around 0.98-0.99 were achieved. Lower rates of linkage quality were associated with missing demographic information and some residual variation in linkage quality across practices was observed. Conclusion/ImplicationsPPRL using Bloom filters is a promising technique for achieving high quality linkage across primary and secondary care in Australia. Further evaluation will assess scalability and quality in Australia but international collaborations are encouraged to more rapidly develop the evidence base and tactical approaches to support real world implementations.

Download Full-text

High quality linkage using Multibit Trees for privacy-preserving blocking

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.149 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Adrian Brown ◽

Christian Borgs ◽

Sean Randall ◽

Rainer Schnell

Keyword(s):

Real World ◽

New Technology ◽

Bloom Filter ◽

Privacy Preserving ◽

Sensitive Data ◽

High Quality ◽

Linkage Quality ◽

Real World Datasets ◽

Similarity Thresholds ◽

F Measure

ABSTRACT ObjectivesAs privacy-preserving record linkage (PPRL) emerges as a method for linking sensitive data, efficient blocking techniques that help maintain high levels of linkage quality are required. This research looks at the use of a Q-gram Fingerprinting blocking technique, with Multibit Trees, and applies this method to real-world datasets. ApproachData comprised ten years of hospital and mortality records from several Australian states, totalling over 25 million records. Each record contained a linkage key, as defined by the jurisdiction, which was used to assess quality (i.e. used as a ‘gold standard’). Different parameter sets were defined for the linkage tests with a privacy-preserved file created for each parameter set. The files contained jurisdictional linkage key and a Cryptographic Long-term Key (the CLK is a Bloom filter comprising all fields in the parameter set). Each file was run through an implementation of the Q-gram Fingerprinting blocking algorithm as a deduplication technique, using different similarity thresholds. The quality metrics of precision, recall and f-measure were calculated. ResultsResultant quality varied for each parameter set. Adding suburb and postcode reduced the linkage quality. The best parameter set returned an F-measure of 0.951. In general, precision was high in all settings, but recall fell as more fields were added to the CLK. We will report details for all parameter settings and their corresponding results. ConclusionThe Q-gram Fingerprinting blocking technique shows promise for maintaining high quality linkage in reasonable time. Determining which fields to include in the CLK for the linkage of specific datasets is important to maximise linkage quality, as well as selecting optimal similarity thresholds. Developing new technology is important for progressing the implementation of PPRL in real-world settings.

Download Full-text

Privacy Attack on Multiple Dynamic Match-key based Privacy-Preserving Record Linkage

International Journal for Population Data Science ◽

10.23889/ijpds.v5i1.1345 ◽

2020 ◽

Vol 5 (1) ◽

Author(s):

Anushka Vidanage ◽

Thilina Ranbaduge ◽

Peter Christen ◽

Sean Randall

Keyword(s):

Record Linkage ◽

Personal Information ◽

Privacy Preserving ◽

Statistical Correlation ◽

Sensitive Information ◽

Frequency Distributions ◽

Plain Text ◽

Large Databases ◽

Linkage Quality ◽

Privacy Attack

Introduction Over the last decade, the demand for linking records about people across databases has increased in various domains. Privacy challenges associated with linking sensitive information led to the development of privacy-preserving record linkage techniques. The multiple dynamic match-key encoding approach recently proposed by Randall et al. (IJPDS, 2019) is such a technique aimed at providing sufficient privacy for linkage applications while obtaining high linkage quality. However, the use of this encoding in large databases can reveal frequency information that can allow the re-identification of encoded values. Objectives We propose a frequency-based attack to evaluate the privacy guarantees of multiple dynamic match-key encoding. We then present two improvements to this match-key encoding approach to prevent such a privacy attack. Methods The proposed attack analyses the frequency distributions of individual match-keys in order to identify the attributes used for each match-key, where we assume the adversary has access to a plain-text database with similar characteristics as the encoded database. We employ a set of statistical correlation tests to compare the frequency distributions of match-key values between the encoded and plain-text databases. Once the attribute combinations used for match-keys are discovered, we then re-identify encoded sensitive values by utilising a frequency alignment method. Next, we propose two modifications to the match-key encoding; one to alter the original frequency distributions and another to make the frequency distributions uniform. Both will help to prevent frequency-based attacks. Results We evaluate our privacy attack using two large real-world databases. The results show that in certain situations the attack can successfully re-identify a set of sensitive values encoded using the multiple dynamic match-key encoding approach. On the databases used in our experiments, the attack is able to re-identify plain-text values with a precision and recall of both up to 98%. Furthermore, we show that our proposed improvements are able to make this attack harder to perform with only a small reduction in linkage quality. Conclusions Our proposed privacy attack demonstrates the weaknesses of multiple match-key encoding that should be taken into consideration when linking databases that contain sensitive personal information. Our proposed modifications ensure that the multiple dynamic match-key encoding approach can be used securely while retaining high linkage quality.

Download Full-text

Overcoming the Impasse 2: Assessing the Quality of Recent Australian Applications of a Privacy-Preserving Record Linkage Method (PPRL-BLOOM)

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1489 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Sean Randall ◽

Adrian Brown ◽

Anna Ferrante ◽

James Boyd ◽

Katie Irvine ◽

...

Keyword(s):

Real World ◽

Child Protection ◽

Record Linkage ◽

Data Linkage ◽

Privacy Preserving ◽

Third Party ◽

Regulatory Constraints ◽

Linkage Quality ◽

Personally Identifying Information

IntroductionWhile the quantity and type of datasets used by data linkage projects is growing, there remain some datasets that are ‘not available’ or ‘hard to access’ by researchers and linkers, either due to legal/regulatory constraints restricting the release of personally identifying information or because of privacy or reputational concerns. Advances in privacy-preserving record linkage methods (e.g. PPRL-Bloom) have made it possible to overcome this impasse. These techniques aim to provide strong privacy protection while still maintaining high linkage quality. PPRL-Bloom methods are being used in practice. The Centre for Data Linkage (CDL) at Curtin University has been involved in several PPRL linkage and evaluation projects using real-world data. As the methods are relatively new, published information on achievable linkage quality in real-world scenarios is limited. Objectives and ApproachWe present and describe several real-world applications of privacy preserving record linkage (PPRL-Bloom) where the quality of the linkage could be ascertained. In each case, data was linked ‘blind’; that is, without linkers having access to the original personal identifiers at any stage, or having any additional information about the records. Evaluations include a linkage of state-based morbidity and mortality records, a linkage of a number of general practice datasets to morbidity and emergency records, and a linkage of a range of state-based non-health administrative data, including education, police, housing, birth and child protection records. ResultsThe privacy preserving record linkage performed admirably, with very high-quality results across all evaluations. Conclusion / ImplicationsPrivacy preserving linkage is a useful and innovative methodology that is currently being used in real world projects. The results of these evaluation suggest it can be an appropriate linkage tool when legal or other constraints block release of personally identifying information to third party linkage units.

Download Full-text

An Evaluation Framework for Privacy-Preserving Record Linkage

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v6i1.636 ◽

2014 ◽

Vol 6 (1) ◽

Cited By ~ 20

Author(s):

Dinusha Vatsalan ◽

Peter Christen ◽

Christine M. O'Keefe ◽

Vassilios S. Verykios

Keyword(s):

Real World ◽

Record Linkage ◽

Comparative Evaluation ◽

Privacy Preserving ◽

Evaluation Framework ◽

Sensitive Information ◽

The Past ◽

Large Databases ◽

Linkage Quality ◽

The Face

Privacy-preserving record linkage (PPRL) addresses the problem of identifying matching records from different databases that correspond to the same real-world entities using quasi-identifying attributes (in the absence of unique entity identifiers), while preserving privacy of these entities. Privacy is being preserved by not revealing any information that could be used to infer the actual values about the records that are not reconciled to the same entity (non-matches), and any confidential or sensitive information (that is not agreed upon by the data custodians) about the records that were reconciled to the same entity (matches) during or after the linkage process. The PPRL process often involves three main challenges, which are scalability to large databases, high linkage quality in the presence of data quality errors, and sufficient privacy guarantees. While many solutions have been developed for the PPRL problem over the past two decades, an evaluation and comparison framework of PPRL solutions with standard numerical measures defined for all three properties (scalability, linkage quality, and privacy) of PPRL has so far not been presented in the literature. We propose a general framework with normalized measures to practically evaluate and compare PPRL solutions in the face of linkage attack methods that are based on an external global dataset. We conducted experiments of several existing PPRL solutions on real-world databases using our proposed evaluation framework, and the results show that our framework provides an extensive and comparative evaluation of PPRL solutions in terms of the three properties.

Download Full-text

Implementing a hash-based privacy-preserving record linkage tool in the OneFlorida clinical research network

JAMIA Open ◽

10.1093/jamiaopen/ooz050 ◽

2019 ◽

Vol 2 (4) ◽

pp. 562-569 ◽

Cited By ~ 3

Author(s):

Jiang Bian ◽

Alexander Loiacono ◽

Andrei Sura ◽

Tonatiuh Mendoza Viramontes ◽

Gloria Lipori ◽

...

Keyword(s):

Clinical Research ◽

Open Source ◽

Real World ◽

Record Linkage ◽

Gold Standard ◽

Privacy Preserving ◽

Research Network ◽

Clinical Research Network ◽

Real World Setting ◽

Hash Codes

Abstract Objective To implement an open-source tool that performs deterministic privacy-preserving record linkage (RL) in a real-world setting within a large research network. Materials and Methods We learned 2 efficient deterministic linkage rules using publicly available voter registration data. We then validated the 2 rules’ performance with 2 manually curated gold-standard datasets linking electronic health records and claims data from 2 sources. We developed an open-source Python-based tool—OneFL Deduper—that (1) creates seeded hash codes of combinations of patients’ quasi-identifiers using a cryptographic one-way hash function to achieve privacy protection and (2) links and deduplicates patient records using a central broker through matching of hash codes with a high precision and reasonable recall. Results We deployed the OneFl Deduper (https://github.com/ufbmi/onefl-deduper) in the OneFlorida, a state-based clinical research network as part of the national Patient-Centered Clinical Research Network (PCORnet). Using the gold-standard datasets, we achieved a precision of 97.25∼99.7% and a recall of 75.5%. With the tool, we deduplicated ∼3.5 million (out of ∼15 million) records down to 1.7 million unique patients across 6 health care partners and the Florida Medicaid program. We demonstrated the benefits of RL through examining different disease profiles of the linked cohorts. Conclusions Many factors including privacy risk considerations, policies and regulations, data availability and quality, and computing resources, can impact how a RL solution is constructed in a real-world setting. Nevertheless, RL is a significant task in improving the data quality in a network so that we can draw reliable scientific discoveries from these massive data resources.

Download Full-text

Implementing privacy-preserving record linkage: welcome to the real world

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.153 ◽

2017 ◽

Vol 1 (1) ◽

Cited By ~ 1

Author(s):

James Boyd ◽

Anna Ferrante ◽

Adrian Brown ◽

Sean Randall ◽

James Semmens

Keyword(s):

Data Integration ◽

Real World ◽

Error Detection ◽

Record Linkage ◽

Large Scale ◽

Privacy Preserving ◽

Estimation Methods ◽

Experimental Conditions ◽

Encrypted Data ◽

Personally Identifying Information

ABSTRACT ObjectivesWhile record linkage has become a strategic research priority within Australia and internationally, legal and administrative issues prevent data linkage in some situations due to privacy concerns. Even current best practices in record linkage carry some privacy risk as they require the release of personally identifying information to trusted third parties. Application of record linkage systems that do not require the release of personal information can overcome legal and privacy issues surrounding data integration. Current conceptual and experimental privacy-preserving record linkage (PPRL) models show promise in addressing data integration challenges but do not yet address all of the requirements for real-world operations. This paper aims to identify and address some of the challenges of operationalising PPRL frameworks. ApproachTraditional linkage processes involve comparing personally identifying information (name, address, date of birth) on pairs of records to determine whether the records belong to the same person. Designing appropriate linkage strategies is an important part of the process. These are typically based on the analysis of data attributes (metadata) such as data completeness, consistency, constancy and field discriminating power. Under a PPRL model, however, these factors cannot be discerned from the encrypted data, so an alternative approach is required. This paper explores methods for data profiling, blocking, weight/threshold estimation and error detection within a PPRL framework. ResultsProbabilistic record linkage typically involves the estimation of weights and thresholds to optimise the linkage and ensure highly accurate results. The paper outlines the metadata requirements and automated methods necessary to collect data without compromising privacy. We present work undertaken to develop parameter estimation methods which can help optimise a linkage strategy without the release of personally identifiable information. These are required in all parts of the privacy preserving record linkage process (pre-processing, standardising activities, linkage, grouping and extracting). ConclusionsPPRL techniques that operate on encrypted data have the potential for large-scale record linkage, performing both accurately and efficiently under experimental conditions. Our research has advanced the current state of PPRL with a framework for secure record linkage that can be implemented to improve and expand linkage service delivery while protecting an individual’s privacy. However, more research is required to supplement this technique with additional elements to ensure the end-to-end method is practical and can be incorporated into real-world models.

Download Full-text

Optimization of the Mainzelliste software for fast privacy-preserving record linkage

Journal of Translational Medicine ◽

10.1186/s12967-020-02678-1 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Florens Rohde ◽

Martin Franke ◽

Ziad Sehili ◽

Martin Lablans ◽

Erhard Rahm

Keyword(s):

Record Linkage ◽

Comprehensive Evaluation ◽

Personal Data ◽

Privacy Preserving ◽

Use Cases ◽

Locality Sensitive Hashing ◽

Third Party ◽

Bloom Filters ◽

Linkage Quality ◽

Order Of Magnitude

Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.

Download Full-text

A blinded evaluation of privacy preserving record linkage with Bloom filters

BMC Medical Research Methodology ◽

10.1186/s12874-022-01510-2 ◽

2022 ◽

Vol 22 (1) ◽

Author(s):

Sean Randall ◽

Helen Wichmann ◽

Adrian Brown ◽

James Boyd ◽

Tom Eitelhuber ◽

...

Keyword(s):

Record Linkage ◽

Bloom Filter ◽

Privacy Preserving ◽

Bloom Filters ◽

Filter Method ◽

Morbidity Data ◽

Linkage Quality ◽

Hospital Morbidity ◽

Western Australian ◽

Direct Investigation

Abstract Background Privacy preserving record linkage (PPRL) methods using Bloom filters have shown promise for use in operational linkage settings. However real-world evaluations are required to confirm their suitability in practice. Methods An extract of records from the Western Australian (WA) Hospital Morbidity Data Collection 2011–2015 and WA Death Registrations 2011–2015 were encoded to Bloom filters, and then linked using privacy-preserving methods. Results were compared to a traditional, un-encoded linkage of the same datasets using the same blocking criteria to enable direct investigation of the comparison step. The encoded linkage was carried out in a blinded setting, where there was no access to un-encoded data or a ‘truth set’. Results The PPRL method using Bloom filters provided similar linkage quality to the traditional un-encoded linkage, with 99.3% of ‘groupings’ identical between privacy preserving and clear-text linkage. Conclusion The Bloom filter method appears suitable for use in situations where clear-text identifiers cannot be provided for linkage.

Download Full-text