Securing Bloom Filters for Privacy-preserving Record Linkage

Author(s):  
Thilina Ranbaduge ◽  
Rainer Schnell
Author(s):  
Rainer Schnell ◽  
Christian Borgs

ABSTRACTObjectiveIn most European settings, record linkage across different institutions has to be based on personal identifiers such as names, birthday or place of birth. To protect the privacy of research subjects, the identifiers have to be encrypted. In practice, these identifiers show error rates up to 20% per identifier, therefore linking on encrypted identifiers usually implies the loss of large subsets of the databases. In many applications, this loss of cases is related to variables of interest for the subject matter of the study. Therefore, this kind of record-linkage will generate biased estimates. These problems gave rise to techniques of Privacy Preserving Record Linkage (PPRL). Many different PPRL techniques have been suggested within the last 10 years, very few of them are suitable for practical applications with large database containing millions of records as they are typical for administrative or medical databases. One proven technique for PPRL for large scale applications is PPRL based on Bloom filters.MethodUsing appropriate parameter settings, Bloom filter approaches show linkage results comparable to linkage based on unencrypted identifiers. Furthermore, this approach has been used in real-world settings with data sets containing up to 100 Million records. By the application of suitable blocking strategies, linking can be done in reasonable time.ResultHowever, Bloom filters have been subject of cryptographic attacks. Previous research has shown that the straight application of Bloom filters has a nonzero re-identification risk. We will present new results on recently developed techniques to defy all known attacks on PPRL Bloom filters. These computationally simple algorithms modify the identifiers by different cryptographic diffusion techniques. The presentation will demonstrate these new algorithms and show their performance concerning precision, recall and re-identification risk on large databases.


2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Florens Rohde ◽  
Martin Franke ◽  
Ziad Sehili ◽  
Martin Lablans ◽  
Erhard Rahm

Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.


2021 ◽  
Author(s):  
Christopher Hampf ◽  
Martin Bialke ◽  
Hauke Hund ◽  
Christian Fegeler ◽  
Stefan Lang ◽  
...  

Abstract BackgroundThe Federal Ministry of Research and Education funded the Network of University Medicine for establishing an infrastructure for pandemic research. This includes the development of a COVID-19 Data Exchange Platform (CODEX) that provides standardised and harmonised data sets for COVID-19 research. Nearly all university hospitals in Germany are part of the project and transmit medical data from the local data integration centres to the CODEX platform. The medical data on a person that has been collected at several sites is to be made available on the CODEX platform in a merged form. To enable this, a federated trusted third party (fTTP) will be established, which will allow the pseudonymised merging of the medical data. The fTTP implements privacy preserving record linkage based on Bloom filters and assigns pseudonyms to enable re-pseudonymisation during data transfer to the CODEX platform.ResultsThe fTTP was implemented conceptually and technically. For this purpose, the processes that are necessary for data delivery were modelled. The resulting communication relationships were identified and corresponding interfaces were specified. These were developed according to the specifications in FHIR and validated with the help of external partners. Existing tools such as the identity management system E-PIX® were further developed accordingly so that sites can generate Bloom filters based on person identifying information. An extension for the comparison of Bloom filters was implemented for the federated trust third party. The correct implementation was shown in the form of a demonstrator and the connection of two data integration centres.ConclusionsThis article describes how the fTTP was modelled and implemented. In a first expansion stage, the fTTP was exemplarily connected through two sites and its functionality was demonstrated. Further expansion stages, which are already planned, have been technically specified and will be implemented in the future in order to also handle cases in which the privacy preserving record linkage achieves ambiguous results. The first expansion stage of the fTTP is available in the University Medicine network and will be connected by all participating sites in the ongoing test phase.


2022 ◽  
Vol 22 (1) ◽  
Author(s):  
Sean Randall ◽  
Helen Wichmann ◽  
Adrian Brown ◽  
James Boyd ◽  
Tom Eitelhuber ◽  
...  

Abstract Background Privacy preserving record linkage (PPRL) methods using Bloom filters have shown promise for use in operational linkage settings. However real-world evaluations are required to confirm their suitability in practice. Methods An extract of records from the Western Australian (WA) Hospital Morbidity Data Collection 2011–2015 and WA Death Registrations 2011–2015 were encoded to Bloom filters, and then linked using privacy-preserving methods. Results were compared to a traditional, un-encoded linkage of the same datasets using the same blocking criteria to enable direct investigation of the comparison step. The encoded linkage was carried out in a blinded setting, where there was no access to un-encoded data or a ‘truth set’. Results The PPRL method using Bloom filters provided similar linkage quality to the traditional un-encoded linkage, with 99.3% of ‘groupings’ identical between privacy preserving and clear-text linkage. Conclusion The Bloom filter method appears suitable for use in situations where clear-text identifiers cannot be provided for linkage.


Author(s):  
James Boyd ◽  
Yuen Ai Lee ◽  
Adrian Brown ◽  
Sean Randall ◽  
Anna Ferrante

BackgroundGeneral practice is a rich source of health data for research. It is an important resource which can be used to improve patient management, reduce costs and improve patient outcomes. Traditionally, the challenge has been around access to general practice data which remains hard to ‘join up’. This abstract describes technology developed to support aspirations of the MedicineInsight program to provide linked de-identified general practice data that can be used to derive insights to enable better patient outcomes. Main AimThe aim of this project was to use real-world data to identify technical, logistical and analytical requirements throughout the linkage process. Logistical aims covered the negotiation, approval and data acquisition processes, as well as data linkage and data delivery aspects performed by technical and data service stakeholders. Methods/Approach Given the sensitivity of the information involved, the project employed a privacy preserving record linkage methodology. This method uses encrypted personal identifying information (Bloom filters) in a probability-based linkage framework to help mitigate risk while maximising linkage quality. Existing MedicineInsight systems were extended to automatically generate encoded linkage data at each general practice. Pilot linkages were then used to validate the capability/capacity of CDL infrastructure to create secure extensible linked general practice datasets. ResultsThe project has successfully developed interoperable technology to create a transparent data catalogue which is linkable to other datasets. This technology has been embedded with MedicineInsight systems and results of the pilot linkages are being evaluated. The project will make recommendations to enable consistent delivery of linkage services across care settings. ConclusionOutcomes from the project will improve delivery of record linkage services to the health and broader research community. Using linked data from across the care continuum, researchers will be able to evaluate the effectiveness of service delivery and provide evidence for policy and programme development.


Author(s):  
Rainer Schnell ◽  
Tobias Bachteler ◽  
Jörg Reiher

2014 ◽  
Author(s):  
Frank Niedermeyer ◽  
Simone Steinmetzer ◽  
Martin Kroll ◽  
Rainer Schnell

Sign in / Sign up

Export Citation Format

Share Document