scholarly journals Overcoming the Impasse 2: Assessing the Quality of Recent Australian Applications of a Privacy-Preserving Record Linkage Method (PPRL-BLOOM)

Author(s):  
Sean Randall ◽  
Adrian Brown ◽  
Anna Ferrante ◽  
James Boyd ◽  
Katie Irvine ◽  
...  

IntroductionWhile the quantity and type of datasets used by data linkage projects is growing, there remain some datasets that are ‘not available’ or ‘hard to access’ by researchers and linkers, either due to legal/regulatory constraints restricting the release of personally identifying information or because of privacy or reputational concerns. Advances in privacy-preserving record linkage methods (e.g. PPRL-Bloom) have made it possible to overcome this impasse. These techniques aim to provide strong privacy protection while still maintaining high linkage quality. PPRL-Bloom methods are being used in practice. The Centre for Data Linkage (CDL) at Curtin University has been involved in several PPRL linkage and evaluation projects using real-world data. As the methods are relatively new, published information on achievable linkage quality in real-world scenarios is limited. Objectives and ApproachWe present and describe several real-world applications of privacy preserving record linkage (PPRL-Bloom) where the quality of the linkage could be ascertained. In each case, data was linked ‘blind’; that is, without linkers having access to the original personal identifiers at any stage, or having any additional information about the records. Evaluations include a linkage of state-based morbidity and mortality records, a linkage of a number of general practice datasets to morbidity and emergency records, and a linkage of a range of state-based non-health administrative data, including education, police, housing, birth and child protection records. ResultsThe privacy preserving record linkage performed admirably, with very high-quality results across all evaluations. Conclusion / ImplicationsPrivacy preserving linkage is a useful and innovative methodology that is currently being used in real world projects. The results of these evaluation suggest it can be an appropriate linkage tool when legal or other constraints block release of personally identifying information to third party linkage units.

Author(s):  
Thilina Ranbaduge ◽  
Dinusha Vatsalan ◽  
Sean Randall ◽  
Peter Christen

ABSTRACT ObjectiveThe linking of multiple (three or more) health databases is challenging because of the increasing sizes of databases, the number of parties among which they are to be linked, and privacy concerns related to the use of personal data such as names, addresses, or dates of birth. This entails a need to develop advanced scalable techniques for linking multiple databases while preserving the privacy of the individuals they contain. In this study we empirically evaluate several state-of-the-art multi-party privacy-preserving record linkage (MP-PPRL) techniques with large real-world health databases from Australia. ApproachMP-PPRL is conducted such that no sensitive information is revealed about database records that can be used to infer knowledge about individuals or groups of individuals. Current state-of-the-art methods used in this evaluation use Bloom filters to encode personal identifying information. The empirical evaluation comprises of different multi-party private blocking and matching techniques that are evaluated for different numbers of parties. Each database contains more than 700,000 records extracted from ten years of New South Wales (NSW) emergency presentation data. Each technique is evaluated with regard to scalability, quality and privacy. Scalability and quality are measured using the metrics of reduction ratio, pairs completeness, precision, recall, and F-measure. Privacy is measured using disclosure risk metrics that are based on the probability of suspicion, defined as the likelihood that a record in an encoded database matches to one or more record(s) in a publicly available database such as a telephone directory. MP-PPRL techniques that either utilize a trusted linkage unit, and those that do not, are evaluated. ResultsExperimental results showed MP-PPRL methods are practical for linking large-scale real world data. Private blocking techniques achieved significantly higher privacy than standard hashing-based techniques with a maximum disclosure risk of 0.0003 and 1, respectively, at a small cost to linkage quality and efficiency. Similarly, private matching techniques provided a similar acceptable reduction in linkage quality compared to standard non-private matching while providing high privacy protection. ConclusionThe adoption of privacy-preserving linkage methods has the ability to significantly reduce privacy risks associated with linking large health databases, and enable the data linkage community to offer operational linkage services not previously possible. The evaluation results show that these state-of-the-art MP-PPRL techniques are scalable in terms of database sizes and number of parties, while providing significantly improved privacy with an associated trade-off in linkage quality compared to standard linkage techniques.


Author(s):  
James Boyd ◽  
Anna Ferrante ◽  
Adrian Brown ◽  
Sean Randall ◽  
James Semmens

ABSTRACT ObjectivesWhile record linkage has become a strategic research priority within Australia and internationally, legal and administrative issues prevent data linkage in some situations due to privacy concerns. Even current best practices in record linkage carry some privacy risk as they require the release of personally identifying information to trusted third parties. Application of record linkage systems that do not require the release of personal information can overcome legal and privacy issues surrounding data integration. Current conceptual and experimental privacy-preserving record linkage (PPRL) models show promise in addressing data integration challenges but do not yet address all of the requirements for real-world operations. This paper aims to identify and address some of the challenges of operationalising PPRL frameworks. ApproachTraditional linkage processes involve comparing personally identifying information (name, address, date of birth) on pairs of records to determine whether the records belong to the same person. Designing appropriate linkage strategies is an important part of the process. These are typically based on the analysis of data attributes (metadata) such as data completeness, consistency, constancy and field discriminating power. Under a PPRL model, however, these factors cannot be discerned from the encrypted data, so an alternative approach is required. This paper explores methods for data profiling, blocking, weight/threshold estimation and error detection within a PPRL framework. ResultsProbabilistic record linkage typically involves the estimation of weights and thresholds to optimise the linkage and ensure highly accurate results. The paper outlines the metadata requirements and automated methods necessary to collect data without compromising privacy. We present work undertaken to develop parameter estimation methods which can help optimise a linkage strategy without the release of personally identifiable information. These are required in all parts of the privacy preserving record linkage process (pre-processing, standardising activities, linkage, grouping and extracting). ConclusionsPPRL techniques that operate on encrypted data have the potential for large-scale record linkage, performing both accurately and efficiently under experimental conditions. Our research has advanced the current state of PPRL with a framework for secure record linkage that can be implemented to improve and expand linkage service delivery while protecting an individual’s privacy. However, more research is required to supplement this technique with additional elements to ensure the end-to-end method is practical and can be incorporated into real-world models.


2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Florens Rohde ◽  
Martin Franke ◽  
Ziad Sehili ◽  
Martin Lablans ◽  
Erhard Rahm

Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.


Author(s):  
Dinusha Vatsalan ◽  
Peter Christen ◽  
Christine M. O'Keefe ◽  
Vassilios S. Verykios

Privacy-preserving record linkage (PPRL) addresses the problem of identifying matching records from different databases that correspond to the same real-world entities using quasi-identifying attributes (in the absence of unique entity identifiers), while preserving privacy of these entities. Privacy is being preserved by not revealing any information that could be used to infer the actual values about the records that are not reconciled to the same entity (non-matches), and any confidential or sensitive information (that is not agreed upon by the data custodians) about the records that were reconciled to the same entity (matches) during or after the linkage process. The PPRL process often involves three main challenges, which are scalability to large databases, high linkage quality in the presence of data quality errors, and sufficient privacy guarantees. While many solutions have been developed for the PPRL problem over the past two decades, an evaluation and comparison framework of PPRL solutions with standard numerical measures defined for all three properties (scalability, linkage quality, and privacy) of PPRL has so far not been presented in the literature. We propose a general framework with normalized measures to practically evaluate and compare PPRL solutions in the face of linkage attack methods that are based on an external global dataset. We conducted experiments of several existing PPRL solutions on real-world databases using our proposed evaluation framework, and the results show that our framework provides an extensive and comparative evaluation of PPRL solutions in terms of the three properties.


Author(s):  
Thiago Nóbrega ◽  
Carlos Eduardo S. Pires ◽  
Dimas Cassimiro Nascimento

Privacy-Preserving Record Linkage (PPRL) intends to integrate private/sensitive data from several data sources held by different parties. It aims to identify records (e.g., persons or objects) representing the same real-world entity over private data sources held by different custodians. Due to recent laws and regulations (e.g., General Data Protection Regulation), PPRL approaches are increasingly demanded in real-world application areas such as health care, credit analysis, public policy evaluation, and national security. As a result, the PPRL process needs to deal with efficacy (linkage quality), and privacy problems. For instance, the PPRL process needs to be executed over data sources (e.g., a database containing personal information of governmental income distribution and assistance programs), with an accurate linkage of the entities, and, at the same time, protect the privacy of the information. Thus, this work intends to simplify the PPRL process by facilitating real-world applications (such as medical, epidemiologic, and populational studies) to reduce legal and bureaucratic efforts to access and process the data, making these applications' execution more straightforward for companies and governments. In this context, this work presents two major contributions to PPRL: i) an improvement to the linkage quality and simplify the process by employing Machine Learning techniques to decide whether two records represent the same entity, or not; and ii) we enable the auditability the computations performed during PPRL.


Author(s):  
Anna Ferrante ◽  
James Boyd ◽  
Sean Randall ◽  
Adrian Brown ◽  
James Semmens

ABSTRACT ObjectivesRecord linkage is a powerful technique which transforms discrete episode data into longitudinal person-based records. These records enable the construction and analysis of complex pathways of health and disease progression, and service use. Achieving high linkage quality is essential for ensuring the quality and integrity of research based on linked data. The methods used to assess linkage quality will depend on the volume and characteristics of the datasets involved, the processes used for linkage and the additional information available for quality assessment. This paper proposes and evaluates two methods to routinely assess linkage quality. ApproachLinkage units currently use a range of methods to measure, monitor and improve linkage quality; however, no common approach or standards exist. There is an urgent need to develop “best practices” in evaluating, reporting and benchmarking linkage quality. In assessing linkage quality, of primary interest is in knowing the number of true matches and non-matches identified as links and non-links. Any misclassification of matches within these groups introduces linkage errors. We present efforts to develop sharable methods to measure linkage quality in Australia. This includes a sampling-based method to estimate both precision (accuracy) and recall (sensitivity) following record linkage and a benchmarking method - a transparent and transportable methodology to benchmark the quality of linkages across different operational environments. ResultsThe sampling-based method achieved estimates of linkage quality that were very close to actual linkage quality metrics. This method presents as a feasible means of accurately estimating matching quality and refining linkages in population level linkage studies. The benchmarking method provides a systematic approach to estimating linkage quality with a set of open and shareable datasets and a set of well-defined, established performance metrics. The method provides an opportunity to benchmark the linkage quality of different record linkage operations. Both methods have the potential to assess the inter-rater reliability of clerical reviews. ConclusionsBoth methods produce reliable estimates of linkage quality enabling the exchange of information within and between linkage communities. It is important that researchers can assess risk in studies using record linkage techniques. Understanding the impact of linkage quality on research outputs highlights a need for standard methods to routinely measure linkage quality. These two methods provide a good start to the quality process, but it is important to identify standards and good practices in all parts of the linkage process (pre-processing, standardising activities, linkage, grouping and extracting).


Author(s):  
Nadine Wiggins ◽  
Brian Stokes

ABSTRACTObjectivesThe Tasmanian Data Linkage Unit (TDLU) was established through the University of Tasmania in 2011 with the first dataset imported to its Master Linkage Map (MLM) during 2014. Tasmania an island state of Australia, has a population of approximately 516,000. From the TDLU’s earliest inception, it was deemed important to build a high quality linkage spine comprising key administrative data representative of significant state health and related datasets to support quality population level research.ApproachThe TDLU has embraced a model of continual quality and process enhancement as a determined strategy to support ongoing business improvement. Initial linkage approaches utilised ‘traditional’ methods of reviewing record pairs within an upper and lower confidence range. This approach resulted in false record pairs with high confidence levels being linked (false positives) and true record pairs at lower confidence levels not linked (false negatives). To improve linkage quality, the TDLU has continually refined and modified its clerical review methodology with a specialist software module developed to identify specific record attributes within groups that require the group to be manually reviewed and resolved. A range of SQL queries have also been developed to identify incorrect links and further enhance the linkage quality of the MLM.ResultsThe linkage quality tools implemented have led to improved clerical review and quality assurance processes which in turn have increased the overall quality of the linkage spine. The ‘targeted’ method of clerical review provides easy identification of false positive records, particularly those with high confidence scores such as twins and husband/wife combinations. The review of groups at lower confidence levels has minimised the rate of false negative pairs however further refinement of tools is required to minimise the time spent on reviewing these groups. The clerical review software module has equipped staff with the necessary information to make informed and timely decisions when reviewing groups of records. Detailed documentation is maintained for each linkage project providing continual feedback for system and process improvements as the linkage spine increases in size.ConclusionThe process of clerical review and quality assurance requires a commitment to continual refinement of tools and techniques resulting in a higher quality linkage spine and a reduction in the total time and resource required to link datasets.


JAMIA Open ◽  
2019 ◽  
Vol 2 (4) ◽  
pp. 562-569 ◽  
Author(s):  
Jiang Bian ◽  
Alexander Loiacono ◽  
Andrei Sura ◽  
Tonatiuh Mendoza Viramontes ◽  
Gloria Lipori ◽  
...  

Abstract Objective To implement an open-source tool that performs deterministic privacy-preserving record linkage (RL) in a real-world setting within a large research network. Materials and Methods We learned 2 efficient deterministic linkage rules using publicly available voter registration data. We then validated the 2 rules’ performance with 2 manually curated gold-standard datasets linking electronic health records and claims data from 2 sources. We developed an open-source Python-based tool—OneFL Deduper—that (1) creates seeded hash codes of combinations of patients’ quasi-identifiers using a cryptographic one-way hash function to achieve privacy protection and (2) links and deduplicates patient records using a central broker through matching of hash codes with a high precision and reasonable recall. Results We deployed the OneFl Deduper (https://github.com/ufbmi/onefl-deduper) in the OneFlorida, a state-based clinical research network as part of the national Patient-Centered Clinical Research Network (PCORnet). Using the gold-standard datasets, we achieved a precision of 97.25∼99.7% and a recall of 75.5%. With the tool, we deduplicated ∼3.5 million (out of ∼15 million) records down to 1.7 million unique patients across 6 health care partners and the Florida Medicaid program. We demonstrated the benefits of RL through examining different disease profiles of the linked cohorts. Conclusions Many factors including privacy risk considerations, policies and regulations, data availability and quality, and computing resources, can impact how a RL solution is constructed in a real-world setting. Nevertheless, RL is a significant task in improving the data quality in a network so that we can draw reliable scientific discoveries from these massive data resources.


Sensors ◽  
2020 ◽  
Vol 20 (16) ◽  
pp. 4651
Author(s):  
Yuanbo Cui ◽  
Fei Gao ◽  
Wenmin Li ◽  
Yijie Shi ◽  
Hua Zhang ◽  
...  

Location-Based Services (LBSs) are playing an increasingly important role in people’s daily activities nowadays. While enjoying the convenience provided by LBSs, users may lose privacy since they report their personal information to the untrusted LBS server. Although many approaches have been proposed to preserve users’ privacy, most of them just focus on the user’s location privacy, but do not consider the query privacy. Moreover, many existing approaches rely heavily on a trusted third-party (TTP) server, which may suffer from a single point of failure. To solve the problems above, in this paper we propose a Cache-Based Privacy-Preserving (CBPP) solution for users in LBSs. Different from the previous approaches, the proposed CBPP solution protects location privacy and query privacy simultaneously, while avoiding the problem of TTP server by having users collaborating with each other in a mobile peer-to-peer (P2P) environment. In the CBPP solution, each user keeps a buffer in his mobile device (e.g., smartphone) to record service data and acts as a micro TTP server. When a user needs LBSs, he sends a query to his neighbors first to seek for an answer. The user only contacts the LBS server when he cannot obtain the required service data from his neighbors. In this way, the user reduces the number of queries sent to the LBS server. We argue that the fewer queries are submitted to the LBS server, the less the user’s privacy is exposed. To users who have to send live queries to the LBS server, we employ the l-diversity, a powerful privacy protection definition that can guarantee the user’s privacy against attackers using background knowledge, to further protect their privacy. Evaluation results show that the proposed CBPP solution can effectively protect users’ location and query privacy with a lower communication cost and better quality of service.


Author(s):  
Sebastian Stammler ◽  
Tobias Kussel ◽  
Phillipp Schoppmann ◽  
Florian Stampe ◽  
Galina Tremper ◽  
...  

Abstract Motivation Record Linkage has versatile applications in real-world data analysis contexts, where several data sets need to be linked on the record level in the absence of any exact identifier connecting related records. An example are medical databases of patients, spread across institutions, that have to be linked on personally identifiable entries like name, date of birth or ZIP code. At the same time, privacy laws may prohibit the exchange of this personally identifiable information (PII) across institutional boundaries, ruling out the outsourcing of the record linkage task to a trusted third party. We propose to employ privacy-preserving record linkage (PPRL) techniques that prevent, to various degrees, the leakage of PII while still allowing for the linkage of related records. Results We develop a framework for fault-tolerant PPRL using secure multi-party computation with the medical record keeping software Mainzelliste as the data source. Our solution does not rely on any trusted third party and all PII is guaranteed to not leak under common cryptographic security assumptions. Benchmarks show the feasibility of our approach in realistic networking settings: linkage of a patient record against a database of 10.000 records can be done in 48s over a heavily delayed (100ms) network connection, or 3.9s with a low-latency connection. Availability and implementation The source code of the sMPC node is freely available on Github at https://github.com/medicalinformatics/SecureEpilinker subject to the AGPLv3 license. The source code of the modified Mainzelliste is available at https://github.com/medicalinformatics/MainzellisteSEL.


Sign in / Sign up

Export Citation Format

Share Document