Implementing privacy-preserving record linkage: welcome to the real world

ABSTRACT ObjectivesWhile record linkage has become a strategic research priority within Australia and internationally, legal and administrative issues prevent data linkage in some situations due to privacy concerns. Even current best practices in record linkage carry some privacy risk as they require the release of personally identifying information to trusted third parties. Application of record linkage systems that do not require the release of personal information can overcome legal and privacy issues surrounding data integration. Current conceptual and experimental privacy-preserving record linkage (PPRL) models show promise in addressing data integration challenges but do not yet address all of the requirements for real-world operations. This paper aims to identify and address some of the challenges of operationalising PPRL frameworks. ApproachTraditional linkage processes involve comparing personally identifying information (name, address, date of birth) on pairs of records to determine whether the records belong to the same person. Designing appropriate linkage strategies is an important part of the process. These are typically based on the analysis of data attributes (metadata) such as data completeness, consistency, constancy and field discriminating power. Under a PPRL model, however, these factors cannot be discerned from the encrypted data, so an alternative approach is required. This paper explores methods for data profiling, blocking, weight/threshold estimation and error detection within a PPRL framework. ResultsProbabilistic record linkage typically involves the estimation of weights and thresholds to optimise the linkage and ensure highly accurate results. The paper outlines the metadata requirements and automated methods necessary to collect data without compromising privacy. We present work undertaken to develop parameter estimation methods which can help optimise a linkage strategy without the release of personally identifiable information. These are required in all parts of the privacy preserving record linkage process (pre-processing, standardising activities, linkage, grouping and extracting). ConclusionsPPRL techniques that operate on encrypted data have the potential for large-scale record linkage, performing both accurately and efficiently under experimental conditions. Our research has advanced the current state of PPRL with a framework for secure record linkage that can be implemented to improve and expand linkage service delivery while protecting an individual’s privacy. However, more research is required to supplement this technique with additional elements to ensure the end-to-end method is practical and can be incorporated into real-world models.

Download Full-text

Privacy-Preserving Record Linkage: An international collaboration between Canada, Australia and Wales

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.101 ◽

2017 ◽

Vol 1 (1) ◽

Cited By ~ 2

Author(s):

Conrad Pow ◽

Karey Iron ◽

James Boyd ◽

Adrian Brown ◽

Simon Thompson ◽

...

Keyword(s):

International Collaboration ◽

Record Linkage ◽

Large Scale ◽

Phase 1 ◽

Performance Testing ◽

Privacy Preserving ◽

Data Repository ◽

Hospital Inpatient ◽

Experimental Conditions ◽

Match Rate

ABSTRACT ObjectivesLinkage of “big data” can provide the answers to a variety of health questions that benefit the delivery of patient care, impact of policies, system planning and evaluation. In some jurisdictions, legal and operational barriers may prevent data linkage for research and system evaluation. Collaboration between international research institutions in Canada, Australia and Wales was formed at the Farr Institute International Conference in 2015. This partnership will test privacy-preserving record linkage (PPRL) techniques for linkage accuracy on real datasets held in a Canadian data repository. ApproachBloom filter PPRL techniques have been incorporated into a prototype linkage system. Evaluations on probabilistic linkage using Bloom filters method have shown potential for large-scale record linkage, performing both accurately and efficiently under experimental conditions. The prototype will be used to evaluate the Bloom filter PPRL techniques in 3 phases. Phase 1: 3 tests using simulated data relating to 20 million individuals will be matched to a sub-cohort of 1 million individuals. Phase 2: 100,000 people from hospital inpatient records will be matched to 18 million people in a health system registration file. These tests will inform whether the method can achieve high levels of privacy protection without negatively impacting performance and linkage quality. Performance indicators include match rate and processing efficiency based on record volumes. ResultsLinkage quality will be assessed by the number of true matches and non matches identified as links and non-links. This method will be evaluated using synthetic and real-world datasets, where the true match status is known. Initial performance testing linked a file of 3,000 records to 30,000 with a 100% match result. Subsequent test phases as above will continue to be evaluated and these results will be presented. ConclusionCompletion of the phased tests will confirm the ability to link datasets while preserving privacy. This international collaboration will expand the utility of this prototype linkage system and expand the global knowledge bank focusing on PPRL methods in general. It will also inform how to adapt to local requirements by providing a solution to many common legal and administrative challenges.

Download Full-text

Overcoming the Impasse 2: Assessing the Quality of Recent Australian Applications of a Privacy-Preserving Record Linkage Method (PPRL-BLOOM)

International Journal for Population Data Science ◽

10.23889/ijpds.v5i5.1489 ◽

2020 ◽

Vol 5 (5) ◽

Author(s):

Sean Randall ◽

Adrian Brown ◽

Anna Ferrante ◽

James Boyd ◽

Katie Irvine ◽

...

Keyword(s):

Real World ◽

Child Protection ◽

Record Linkage ◽

Data Linkage ◽

Privacy Preserving ◽

Third Party ◽

Regulatory Constraints ◽

Linkage Quality ◽

Personally Identifying Information

IntroductionWhile the quantity and type of datasets used by data linkage projects is growing, there remain some datasets that are ‘not available’ or ‘hard to access’ by researchers and linkers, either due to legal/regulatory constraints restricting the release of personally identifying information or because of privacy or reputational concerns. Advances in privacy-preserving record linkage methods (e.g. PPRL-Bloom) have made it possible to overcome this impasse. These techniques aim to provide strong privacy protection while still maintaining high linkage quality. PPRL-Bloom methods are being used in practice. The Centre for Data Linkage (CDL) at Curtin University has been involved in several PPRL linkage and evaluation projects using real-world data. As the methods are relatively new, published information on achievable linkage quality in real-world scenarios is limited. Objectives and ApproachWe present and describe several real-world applications of privacy preserving record linkage (PPRL-Bloom) where the quality of the linkage could be ascertained. In each case, data was linked ‘blind’; that is, without linkers having access to the original personal identifiers at any stage, or having any additional information about the records. Evaluations include a linkage of state-based morbidity and mortality records, a linkage of a number of general practice datasets to morbidity and emergency records, and a linkage of a range of state-based non-health administrative data, including education, police, housing, birth and child protection records. ResultsThe privacy preserving record linkage performed admirably, with very high-quality results across all evaluations. Conclusion / ImplicationsPrivacy preserving linkage is a useful and innovative methodology that is currently being used in real world projects. The results of these evaluation suggest it can be an appropriate linkage tool when legal or other constraints block release of personally identifying information to third party linkage units.

Download Full-text

Implementing a hash-based privacy-preserving record linkage tool in the OneFlorida clinical research network

JAMIA Open ◽

10.1093/jamiaopen/ooz050 ◽

2019 ◽

Vol 2 (4) ◽

pp. 562-569 ◽

Cited By ~ 3

Author(s):

Jiang Bian ◽

Alexander Loiacono ◽

Andrei Sura ◽

Tonatiuh Mendoza Viramontes ◽

Gloria Lipori ◽

...

Keyword(s):

Clinical Research ◽

Open Source ◽

Real World ◽

Record Linkage ◽

Gold Standard ◽

Privacy Preserving ◽

Research Network ◽

Clinical Research Network ◽

Real World Setting ◽

Hash Codes

Abstract Objective To implement an open-source tool that performs deterministic privacy-preserving record linkage (RL) in a real-world setting within a large research network. Materials and Methods We learned 2 efficient deterministic linkage rules using publicly available voter registration data. We then validated the 2 rules’ performance with 2 manually curated gold-standard datasets linking electronic health records and claims data from 2 sources. We developed an open-source Python-based tool—OneFL Deduper—that (1) creates seeded hash codes of combinations of patients’ quasi-identifiers using a cryptographic one-way hash function to achieve privacy protection and (2) links and deduplicates patient records using a central broker through matching of hash codes with a high precision and reasonable recall. Results We deployed the OneFl Deduper (https://github.com/ufbmi/onefl-deduper) in the OneFlorida, a state-based clinical research network as part of the national Patient-Centered Clinical Research Network (PCORnet). Using the gold-standard datasets, we achieved a precision of 97.25∼99.7% and a recall of 75.5%. With the tool, we deduplicated ∼3.5 million (out of ∼15 million) records down to 1.7 million unique patients across 6 health care partners and the Florida Medicaid program. We demonstrated the benefits of RL through examining different disease profiles of the linked cohorts. Conclusions Many factors including privacy risk considerations, policies and regulations, data availability and quality, and computing resources, can impact how a RL solution is constructed in a real-world setting. Nevertheless, RL is a significant task in improving the data quality in a network so that we can draw reliable scientific discoveries from these massive data resources.

Download Full-text

Secure Privacy Preserving Record Linkage of Large Databases by Modified Bloom Filter Encodings

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.29 ◽

2017 ◽

Vol 1 (1) ◽

Cited By ~ 2

Author(s):

Rainer Schnell ◽

Christian Borgs

Keyword(s):

Record Linkage ◽

Large Scale ◽

Bloom Filter ◽

Privacy Preserving ◽

Error Rates ◽

Bloom Filters ◽

Data Sets ◽

Research Subjects ◽

Practical Applications ◽

Large Databases

ABSTRACTObjectiveIn most European settings, record linkage across different institutions has to be based on personal identifiers such as names, birthday or place of birth. To protect the privacy of research subjects, the identifiers have to be encrypted. In practice, these identifiers show error rates up to 20% per identifier, therefore linking on encrypted identifiers usually implies the loss of large subsets of the databases. In many applications, this loss of cases is related to variables of interest for the subject matter of the study. Therefore, this kind of record-linkage will generate biased estimates. These problems gave rise to techniques of Privacy Preserving Record Linkage (PPRL). Many different PPRL techniques have been suggested within the last 10 years, very few of them are suitable for practical applications with large database containing millions of records as they are typical for administrative or medical databases. One proven technique for PPRL for large scale applications is PPRL based on Bloom filters.MethodUsing appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time.ResultHowever, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall and re-identification risk on large databases.

Download Full-text

Evaluation of advanced techniques for multi-party privacy-preserving record linkage on real-world health databases

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.106 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Thilina Ranbaduge ◽

Dinusha Vatsalan ◽

Sean Randall ◽

Peter Christen

Keyword(s):

Real World ◽

Record Linkage ◽

State Of The Art ◽

Privacy Preserving ◽

World Health ◽

Disclosure Risk ◽

Linkage Quality ◽

Private Matching ◽

Number Of Parties ◽

Matching Techniques

ABSTRACT ObjectiveThe linking of multiple (three or more) health databases is challenging because of the increasing sizes of databases, the number of parties among which they are to be linked, and privacy concerns related to the use of personal data such as names, addresses, or dates of birth. This entails a need to develop advanced scalable techniques for linking multiple databases while preserving the privacy of the individuals they contain. In this study we empirically evaluate several state-of-the-art multi-party privacy-preserving record linkage (MP-PPRL) techniques with large real-world health databases from Australia. ApproachMP-PPRL is conducted such that no sensitive information is revealed about database records that can be used to infer knowledge about individuals or groups of individuals. Current state-of-the-art methods used in this evaluation use Bloom filters to encode personal identifying information. The empirical evaluation comprises of different multi-party private blocking and matching techniques that are evaluated for different numbers of parties. Each database contains more than 700,000 records extracted from ten years of New South Wales (NSW) emergency presentation data. Each technique is evaluated with regard to scalability, quality and privacy. Scalability and quality are measured using the metrics of reduction ratio, pairs completeness, precision, recall, and F-measure. Privacy is measured using disclosure risk metrics that are based on the probability of suspicion, defined as the likelihood that a record in an encoded database matches to one or more record(s) in a publicly available database such as a telephone directory. MP-PPRL techniques that either utilize a trusted linkage unit, and those that do not, are evaluated. ResultsExperimental results showed MP-PPRL methods are practical for linking large-scale real world data. Private blocking techniques achieved significantly higher privacy than standard hashing-based techniques with a maximum disclosure risk of 0.0003 and 1, respectively, at a small cost to linkage quality and efficiency. Similarly, private matching techniques provided a similar acceptable reduction in linkage quality compared to standard non-private matching while providing high privacy protection. ConclusionThe adoption of privacy-preserving linkage methods has the ability to significantly reduce privacy risks associated with linking large health databases, and enable the data linkage community to offer operational linkage services not previously possible. The evaluation results show that these state-of-the-art MP-PPRL techniques are scalable in terms of database sizes and number of parties, while providing significantly improved privacy with an associated trade-off in linkage quality compared to standard linkage techniques.

Download Full-text

Locational privacy-preserving distance computations with intersecting sets of randomly labeled grid points

International Journal of Health Geographics ◽

10.1186/s12942-021-00268-y ◽

2021 ◽

Vol 20 (1) ◽

Author(s):

Rainer Schnell ◽

Jonas Klingwort ◽

James M. Farrow

Keyword(s):

Programming Languages ◽

Real World ◽

Spatial Data ◽

Large Scale ◽

Privacy Preserving ◽

Real World Data ◽

Data Set ◽

Additional Information ◽

Grid Points ◽

High Level

Abstract Background We introduce and study a recently proposed method for privacy-preserving distance computations which has received little attention in the scientific literature so far. The method, which is based on intersecting sets of randomly labeled grid points, is henceforth denoted as ISGP allows calculating the approximate distances between masked spatial data. Coordinates are replaced by sets of hash values. The method allows the computation of distances between locations L when the locations at different points in time t are not known simultaneously. The distance between $$L_1$$ L 1 and $$L_2$$ L 2 could be computed even when $$L_2$$ L 2 does not exist at $$t_1$$ t 1 and $$L_1$$ L 1 has been deleted at $$t_2$$ t 2 . An example would be patients from a medical data set and locations of later hospitalizations. ISGP is a new tool for privacy-preserving data handling of geo-referenced data sets in general. Furthermore, this technique can be used to include geographical identifiers as additional information for privacy-preserving record-linkage. To show that the technique can be implemented in most high-level programming languages with a few lines of code, a complete implementation within the statistical programming language R is given. The properties of the method are explored using simulations based on large-scale real-world data of hospitals ($$n=850$$ n = 850 ) and residential locations ($$n=13,000$$ n = 13 , 000 ). The method has already been used in a real-world application. Results ISGP yields very accurate results. Our simulation study showed that—with appropriately chosen parameters – 99 % accuracy in the approximated distances is achieved. Conclusion We discussed a new method for privacy-preserving distance computations in microdata. The method is highly accurate, fast, has low computational burden, and does not require excessive storage.

Download Full-text

High-Value Token-Blocking: Efficient Blocking Method for Record Linkage

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3450527 ◽

2021 ◽

Vol 16 (2) ◽

pp. 1-17

Author(s):

Kevin O’hare ◽

Anna Jurek-Loughrey ◽

Cassio De Campos

Keyword(s):

Data Integration ◽

Real World ◽

Record Linkage ◽

Big Data Analytics ◽

Heterogeneous Data ◽

Heterogeneous Data Sources ◽

Document Frequency ◽

Scientific Papers ◽

Computational Resources ◽

Different Characteristics

Data integration is an important component of Big Data analytics. One of the key challenges in data integration is record linkage, that is, matching records that represent the same real-world entity. Because of computational costs, methods referred to as blocking are employed as a part of the record linkage pipeline in order to reduce the number of comparisons among records. In the past decade, a range of blocking techniques have been proposed. Real-world applications require approaches that can handle heterogeneous data sources and do not rely on labelled data. We propose high-value token-blocking (HVTB), a simple and efficient approach for blocking that is unsupervised and schema-agnostic, based on a crafted use of Term Frequency-Inverse Document Frequency. We compare HVTB with multiple methods and over a range of datasets, including a novel unstructured dataset composed of titles and abstracts of scientific papers. We thoroughly discuss results in terms of accuracy, use of computational resources, and different characteristics of datasets and records. The simplicity of HVTB yields fast computations and does not harm its accuracy when compared with existing approaches. It is shown to be significantly superior to other methods, suggesting that simpler methods for blocking should be considered before resorting to more sophisticated methods.

Download Full-text

Federated Trusted Third Party as an Approach for Privacy Preserving Record Linkage in a Large Network of University Medicines in Pandemic Research

10.21203/rs.3.rs-1053445/v1 ◽

2021 ◽

Author(s):

Christopher Hampf ◽

Martin Bialke ◽

Hauke Hund ◽

Christian Fegeler ◽

Stefan Lang ◽

...

Keyword(s):

Data Integration ◽

Record Linkage ◽

Privacy Preserving ◽

Medical Data ◽

Third Party ◽

Bloom Filters ◽

Large Network ◽

Trusted Third Party ◽

University Medicine ◽

Expansion Stage

Abstract BackgroundThe Federal Ministry of Research and Education funded the Network of University Medicine for establishing an infrastructure for pandemic research. This includes the development of a COVID-19 Data Exchange Platform (CODEX) that provides standardised and harmonised data sets for COVID-19 research. Nearly all university hospitals in Germany are part of the project and transmit medical data from the local data integration centres to the CODEX platform. The medical data on a person that has been collected at several sites is to be made available on the CODEX platform in a merged form. To enable this, a federated trusted third party (fTTP) will be established, which will allow the pseudonymised merging of the medical data. The fTTP implements privacy preserving record linkage based on Bloom filters and assigns pseudonyms to enable re-pseudonymisation during data transfer to the CODEX platform.ResultsThe fTTP was implemented conceptually and technically. For this purpose, the processes that are necessary for data delivery were modelled. The resulting communication relationships were identified and corresponding interfaces were specified. These were developed according to the specifications in FHIR and validated with the help of external partners. Existing tools such as the identity management system E-PIX® were further developed accordingly so that sites can generate Bloom filters based on person identifying information. An extension for the comparison of Bloom filters was implemented for the federated trust third party. The correct implementation was shown in the form of a demonstrator and the connection of two data integration centres.ConclusionsThis article describes how the fTTP was modelled and implemented. In a first expansion stage, the fTTP was exemplarily connected through two sites and its functionality was demonstrated. Further expansion stages, which are already planned, have been technically specified and will be implemented in the future in order to also handle cases in which the privacy preserving record linkage achieves ambiguous results. The first expansion stage of the fTTP is available in the University Medicine network and will be connected by all participating sites in the ongoing test phase.

Download Full-text