Privacy-Preserving Record Linkage with Spark

Author(s):  
Onno Valkering ◽  
Adam Belloum
2021 ◽  
pp. 101935
Author(s):  
Thiago Nóbrega ◽  
Carlos Eduardo S. Pires ◽  
Dimas Cassimiro Nascimento

2021 ◽  
pp. 101959
Author(s):  
Sirintra Vaiwsri ◽  
Thilina Ranbaduge ◽  
Peter Christen ◽  
Rainer Schnell

Author(s):  
Dinusha Vatsalan ◽  
Dimitrios Karapiperis ◽  
Vassilios S. Verykios

JAMIA Open ◽  
2019 ◽  
Vol 2 (4) ◽  
pp. 562-569 ◽  
Author(s):  
Jiang Bian ◽  
Alexander Loiacono ◽  
Andrei Sura ◽  
Tonatiuh Mendoza Viramontes ◽  
Gloria Lipori ◽  
...  

Abstract Objective To implement an open-source tool that performs deterministic privacy-preserving record linkage (RL) in a real-world setting within a large research network. Materials and Methods We learned 2 efficient deterministic linkage rules using publicly available voter registration data. We then validated the 2 rules’ performance with 2 manually curated gold-standard datasets linking electronic health records and claims data from 2 sources. We developed an open-source Python-based tool—OneFL Deduper—that (1) creates seeded hash codes of combinations of patients’ quasi-identifiers using a cryptographic one-way hash function to achieve privacy protection and (2) links and deduplicates patient records using a central broker through matching of hash codes with a high precision and reasonable recall. Results We deployed the OneFl Deduper (https://github.com/ufbmi/onefl-deduper) in the OneFlorida, a state-based clinical research network as part of the national Patient-Centered Clinical Research Network (PCORnet). Using the gold-standard datasets, we achieved a precision of 97.25∼99.7% and a recall of 75.5%. With the tool, we deduplicated ∼3.5 million (out of ∼15 million) records down to 1.7 million unique patients across 6 health care partners and the Florida Medicaid program. We demonstrated the benefits of RL through examining different disease profiles of the linked cohorts. Conclusions Many factors including privacy risk considerations, policies and regulations, data availability and quality, and computing resources, can impact how a RL solution is constructed in a real-world setting. Nevertheless, RL is a significant task in improving the data quality in a network so that we can draw reliable scientific discoveries from these massive data resources.


Sign in / Sign up

Export Citation Format

Share Document