scholarly journals Scalable Secure Privacy-Preserving Record Linkage (PPRL) Methods Using Cloud-based Infrastructure

Author(s):  
Toan Ong ◽  
Ibrahim Lazrig ◽  
Indrajit Ray ◽  
Indrakshi Ray ◽  
Michael Kahn

IntroductionBloom Filters (BFs) are a scalable solution for probabilistic privacy-preserving record linkage but BFs can be compromised. Yao’s garbled circuits (GCs) can perform secure multi-party computation to compute the similarity of two BFs without a trusted third party. The major drawback of using BFs and GCs together is poor efficiency. Objectives and ApproachWe evaluated the feasibility of BFs+GCs using high capacity compute engines and implementing a novel parallel processing framework in Google Cloud Compute Engines (GCCE). In the Yao’s two-party secure computation protocol, one party serves as the generator and the other party serves as the evaluator. To link data in parallel, records from both parties are divided into chunks. Linkage between every two chunks in the same block is processed by a thread. The number of threads for linkage depends on available computing resources. We tested the parallelized process in various scenarios with variations in hardware and software configurations. ResultsTwo synthetic datasets with 10K records were linked using BFs+GCs on 12 different software and hardware configurations which varied by: number of CPU cores (4 to 32), memory size (15GB – 28.8GB), number of threads (6-41), and chunk size (50-200 records). The minimum configuration (4 cores; 15GB memory) took 8,062.4s to complete whereas the maximum configuration (32 cores; 28.8GB memory) took 1,454.1s. Increasing the number of threads or changing the chunk size without providing more CPU cores and memory did not improve the efficiency. Efficiency is improved on average by 39.81% when the number of cores and memory on the both sides are doubled. The CPU utilization is maximized (near 100% on both sides) when the computing power of the generator is double the evaluator. Conclusion/ImplicationsThe PPRL runtime of BFs+GCs was greatly improved using parallel processing in a cloud-based infrastructure. A cluster of GCCEs could be leveraged to reduce the runtime of data linkage operations even further. Scalable cloud-based infrastructures can overcome the trade-off between security and efficiency, allowing computationally complex methods to be implemented.

Author(s):  
Sebastian Stammler ◽  
Tobias Kussel ◽  
Phillipp Schoppmann ◽  
Florian Stampe ◽  
Galina Tremper ◽  
...  

Abstract Motivation Record Linkage has versatile applications in real-world data analysis contexts, where several data sets need to be linked on the record level in the absence of any exact identifier connecting related records. An example are medical databases of patients, spread across institutions, that have to be linked on personally identifiable entries like name, date of birth or ZIP code. At the same time, privacy laws may prohibit the exchange of this personally identifiable information (PII) across institutional boundaries, ruling out the outsourcing of the record linkage task to a trusted third party. We propose to employ privacy-preserving record linkage (PPRL) techniques that prevent, to various degrees, the leakage of PII while still allowing for the linkage of related records. Results We develop a framework for fault-tolerant PPRL using secure multi-party computation with the medical record keeping software Mainzelliste as the data source. Our solution does not rely on any trusted third party and all PII is guaranteed to not leak under common cryptographic security assumptions. Benchmarks show the feasibility of our approach in realistic networking settings: linkage of a patient record against a database of 10.000 records can be done in 48s over a heavily delayed (100ms) network connection, or 3.9s with a low-latency connection. Availability and implementation The source code of the sMPC node is freely available on Github at https://github.com/medicalinformatics/SecureEpilinker subject to the AGPLv3 license. The source code of the modified Mainzelliste is available at https://github.com/medicalinformatics/MainzellisteSEL.


2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Florens Rohde ◽  
Martin Franke ◽  
Ziad Sehili ◽  
Martin Lablans ◽  
Erhard Rahm

Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.


2021 ◽  
Author(s):  
Christopher Hampf ◽  
Martin Bialke ◽  
Hauke Hund ◽  
Christian Fegeler ◽  
Stefan Lang ◽  
...  

Abstract BackgroundThe Federal Ministry of Research and Education funded the Network of University Medicine for establishing an infrastructure for pandemic research. This includes the development of a COVID-19 Data Exchange Platform (CODEX) that provides standardised and harmonised data sets for COVID-19 research. Nearly all university hospitals in Germany are part of the project and transmit medical data from the local data integration centres to the CODEX platform. The medical data on a person that has been collected at several sites is to be made available on the CODEX platform in a merged form. To enable this, a federated trusted third party (fTTP) will be established, which will allow the pseudonymised merging of the medical data. The fTTP implements privacy preserving record linkage based on Bloom filters and assigns pseudonyms to enable re-pseudonymisation during data transfer to the CODEX platform.ResultsThe fTTP was implemented conceptually and technically. For this purpose, the processes that are necessary for data delivery were modelled. The resulting communication relationships were identified and corresponding interfaces were specified. These were developed according to the specifications in FHIR and validated with the help of external partners. Existing tools such as the identity management system E-PIX® were further developed accordingly so that sites can generate Bloom filters based on person identifying information. An extension for the comparison of Bloom filters was implemented for the federated trust third party. The correct implementation was shown in the form of a demonstrator and the connection of two data integration centres.ConclusionsThis article describes how the fTTP was modelled and implemented. In a first expansion stage, the fTTP was exemplarily connected through two sites and its functionality was demonstrated. Further expansion stages, which are already planned, have been technically specified and will be implemented in the future in order to also handle cases in which the privacy preserving record linkage achieves ambiguous results. The first expansion stage of the fTTP is available in the University Medicine network and will be connected by all participating sites in the ongoing test phase.


2017 ◽  
Vol 2017 (4) ◽  
pp. 345-364 ◽  
Author(s):  
Adrià Gascón ◽  
Phillipp Schoppmann ◽  
Borja Balle ◽  
Mariana Raykova ◽  
Jack Doerner ◽  
...  

Abstract We propose privacy-preserving protocols for computing linear regression models, in the setting where the training dataset is vertically distributed among several parties. Our main contribution is a hybrid multi-party computation protocol that combines Yao’s garbled circuits with tailored protocols for computing inner products. Like many machine learning tasks, building a linear regression model involves solving a system of linear equations. We conduct a comprehensive evaluation and comparison of different techniques for securely performing this task, including a new Conjugate Gradient Descent (CGD) algorithm. This algorithm is suitable for secure computation because it uses an efficient fixed-point representation of real numbers while maintaining accuracy and convergence rates comparable to what can be obtained with a classical solution using floating point numbers. Our technique improves on Nikolaenko et al.’s method for privacy-preserving ridge regression (S&P 2013), and can be used as a building block in other analyses. We implement a complete system and demonstrate that our approach is highly scalable, solving data analysis problems with one million records and one hundred features in less than one hour of total running time.


Author(s):  
Sean Randall ◽  
Adrian Brown ◽  
Anna Ferrante ◽  
James Boyd ◽  
Katie Irvine ◽  
...  

IntroductionWhile the quantity and type of datasets used by data linkage projects is growing, there remain some datasets that are ‘not available’ or ‘hard to access’ by researchers and linkers, either due to legal/regulatory constraints restricting the release of personally identifying information or because of privacy or reputational concerns. Advances in privacy-preserving record linkage methods (e.g. PPRL-Bloom) have made it possible to overcome this impasse. These techniques aim to provide strong privacy protection while still maintaining high linkage quality. PPRL-Bloom methods are being used in practice. The Centre for Data Linkage (CDL) at Curtin University has been involved in several PPRL linkage and evaluation projects using real-world data. As the methods are relatively new, published information on achievable linkage quality in real-world scenarios is limited. Objectives and ApproachWe present and describe several real-world applications of privacy preserving record linkage (PPRL-Bloom) where the quality of the linkage could be ascertained. In each case, data was linked ‘blind’; that is, without linkers having access to the original personal identifiers at any stage, or having any additional information about the records. Evaluations include a linkage of state-based morbidity and mortality records, a linkage of a number of general practice datasets to morbidity and emergency records, and a linkage of a range of state-based non-health administrative data, including education, police, housing, birth and child protection records. ResultsThe privacy preserving record linkage performed admirably, with very high-quality results across all evaluations. Conclusion / ImplicationsPrivacy preserving linkage is a useful and innovative methodology that is currently being used in real world projects. The results of these evaluation suggest it can be an appropriate linkage tool when legal or other constraints block release of personally identifying information to third party linkage units.


Author(s):  
Tom Gee ◽  
Brendan Behan ◽  
Shannon Lefaivre ◽  
Mahmoud Azimaee ◽  
Moyez Dharsee ◽  
...  

IntroductionThe Ontario Brain Institute has developed Brain-CODE, an informatics platform, to support the acquisition, storage, management and analysis of multi-modal data. The standardized research data within Brain-CODE spans several brain disorders, allowing for integrative analyses, while also providing the opportunity to leverage existing clinical administrative data holdings through external linkages. Objectives and ApproachWithin Ontario, the majority of individuals who access the healthcare system have a unique identifier, the Ontario Health Insurance Plan (OHIP) number. The OHIP number can facilitate linkages with administrative data holdings, such as those at the Institute for Clinical Evaluative Sciences (ICES). Given that OBI is not permitted under Ontario’s privacy legislation to hold OHIP numbers, identifiers for consented participants are encrypted using a public key mechanism upon entry into Brain-CODE, where the private key is inaccessible. To facilitate linkages involving OHIP numbers between Brain-CODE and ICES, Brain-CODE Link software was co-developed by members of the Indoc Consortium. ResultsBrain-CODE Link allows a deterministic linkage between encrypted identifiers (OHIP numbers), without revealing participant identity. The same homomorphic encryption algorithm applied to identifiers upon entry to Brain-CODE, is applied to relevant identifiers within ICES data holdings. Encrypted identifiers from Brain-CODE are securely transferred to ICES, where a comparison computation calculates differences between the encrypted sets. These differences are sent to a semi-trusted third party, who has no access to the original data, to decrypt the differences using the private key. A zero difference indicates a set of matching identifiers. One of the main challenges during testing and development of Brain-CODE Link was ensuring the software was capable of scaling to a population level, performing a large number of comparisons, in a computationally efficient manner. Conclusion/ImplicationsOngoing pilot projects within the areas of epilepsy, neurodevelopment disorders, and neurodegeneration will be the first examples of linkages between Brain-CODE and ICES. Brain-CODE Link has successfully performed several billion test comparisons, indicating its suitability to function as a scalable privacy preserving record linkage to support comprehensive analyses.


Sign in / Sign up

Export Citation Format

Share Document