linkage quality
Recently Published Documents


TOTAL DOCUMENTS

45
(FIVE YEARS 22)

H-INDEX

5
(FIVE YEARS 0)

2022 ◽  
Vol 22 (1) ◽  
Author(s):  
Sean Randall ◽  
Helen Wichmann ◽  
Adrian Brown ◽  
James Boyd ◽  
Tom Eitelhuber ◽  
...  

Abstract Background Privacy preserving record linkage (PPRL) methods using Bloom filters have shown promise for use in operational linkage settings. However real-world evaluations are required to confirm their suitability in practice. Methods An extract of records from the Western Australian (WA) Hospital Morbidity Data Collection 2011–2015 and WA Death Registrations 2011–2015 were encoded to Bloom filters, and then linked using privacy-preserving methods. Results were compared to a traditional, un-encoded linkage of the same datasets using the same blocking criteria to enable direct investigation of the comparison step. The encoded linkage was carried out in a blinded setting, where there was no access to un-encoded data or a ‘truth set’. Results The PPRL method using Bloom filters provided similar linkage quality to the traditional un-encoded linkage, with 99.3% of ‘groupings’ identical between privacy preserving and clear-text linkage. Conclusion The Bloom filter method appears suitable for use in situations where clear-text identifiers cannot be provided for linkage.


Author(s):  
Thiago Nóbrega ◽  
Carlos Eduardo S. Pires ◽  
Dimas Cassimiro Nascimento

Privacy-Preserving Record Linkage (PPRL) intends to integrate private/sensitive data from several data sources held by different parties. It aims to identify records (e.g., persons or objects) representing the same real-world entity over private data sources held by different custodians. Due to recent laws and regulations (e.g., General Data Protection Regulation), PPRL approaches are increasingly demanded in real-world application areas such as health care, credit analysis, public policy evaluation, and national security. As a result, the PPRL process needs to deal with efficacy (linkage quality), and privacy problems. For instance, the PPRL process needs to be executed over data sources (e.g., a database containing personal information of governmental income distribution and assistance programs), with an accurate linkage of the entities, and, at the same time, protect the privacy of the information. Thus, this work intends to simplify the PPRL process by facilitating real-world applications (such as medical, epidemiologic, and populational studies) to reduce legal and bureaucratic efforts to access and process the data, making these applications' execution more straightforward for companies and governments. In this context, this work presents two major contributions to PPRL: i) an improvement to the linkage quality and simplify the process by employing Machine Learning techniques to decide whether two records represent the same entity, or not; and ii) we enable the auditability the computations performed during PPRL.


2021 ◽  
Vol 11 (18) ◽  
pp. 8417
Author(s):  
Robert Nowak ◽  
Wiktor Franus ◽  
Jiarui Zhang ◽  
Yue Zhu ◽  
Xin Tian ◽  
...  

We present an algorithm to find corresponding authors of patents and scientific articles. The authors are given as records in Scopus and the Chinese Patents Database. This issue is known as the record linkage problem, defined as finding and linking individual records from separate databases that refer to the same real-world entity. The presented solution is based on a record linkage framework combined with text feature extraction and machine learning techniques. The main challenges were low data quality, lack of common record identifiers, and a limited number of other attributes shared by both data sources. Matching based solely on an exact comparison of authors’ names does not solve the records linking problem because many Chinese authors share the same full name. Moreover, the English spelling of Chinese names is not standardized in the analyzed data. Three ideas on how to extend attribute sets and improve record linkage quality were proposed: (1) fuzzy matching of names, (2) comparison of abstracts of patents and articles, (3) comparison of scientists’ main research areas calculated using all metadata available. The presented solution was evaluated in terms of matching quality and complexity on ≈250,000 record pairs linked by human experts. The results of numerical experiments show that the proposed strategies increase the quality of record linkage compared to typical solutions.


PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0256535
Author(s):  
M. Loane ◽  
J. E. Given ◽  
J. Tan ◽  
A. Reid ◽  
D. Akhmedzhanova ◽  
...  

EUROCAT is a European network of population-based congenital anomaly (CA) registries. Twenty-one registries agreed to participate in the EUROlinkCAT study to determine if reliable information on the survival of children born with a major CA between 1995 and 2014 can be obtained through linkage to national vital statistics or mortality records. Live birth children with a CA could be linked using personal identifiers to either their national vital statistics (including birth records, death records, hospital records) or to mortality records only, depending on the data available within each region. In total, 18 of 21 registries with data on 192,862 children born with congenital anomalies participated in the study. One registry was unable to get ethical approval to participate and linkage was not possible for two registries due to local reasons. Eleven registries linked to vital statistics and seven registries linked to mortality records only; one of the latter only had identification numbers for 78% of cases, hence it was excluded from further analysis. For registries linking to vital statistics: six linked over 95% of their cases for all years and five were unable to link at least 85% of all live born CA children in the earlier years of the study. No estimate of linkage success could be calculated for registries linking to mortality records. Irrespective of linkage method, deaths that occurred during the first week of life were over three times less likely to be linked compared to deaths occurring after the first week of life. Linkage to vital statistics can provide accurate estimates of survival of children with CAs in some European countries. Bias arises when linkage is not successful, as early neonatal deaths were less likely to be linked. Linkage to mortality records only cannot be recommended, as linkage quality, and hence bias, cannot be assessed.


Agriculture ◽  
2021 ◽  
Vol 11 (8) ◽  
pp. 797
Author(s):  
Viet Hoang

This study empirically analyzes the influence of contract farming on income and farming difficulties in Vietnam by using the econometric models and theoretically identifying the affecting mechanism of contract farming on income, sustainability, and welfare by using the qualitative method. The empirical results show that contract farming insignificantly impacts farms’ income while it can facilitate farming activities and decrease difficulties. The factors of education—head, gender of head, type of crop, and technology may affect farmers’ income. The impacting mechanism of contract farming on income, sustainability, and welfare is theoretically proposed as follows: Contract farming initially impacts the intermediate factors such as cooperative, market access, knowledge and skill, product quality, technology, and support. These factors then affect capacity, linkage, quality, and certification which can enhance farmers’ competitiveness. In the long term, stronger competitiveness, higher price, increasing productivity, and lower cost may significantly improve income, sustainability, and welfare. In general, contract farming may have positive impacts on income, sustainability, and welfare in the medium term and long term. In the short term, the result is not significant due to the similar or lower price comparing with the spot market price, growing production cost, decreasing productivity, and weak contract performance. The findings may help policymakers decide how to expand contract farming and its benefits. Economic scholars can test and compare both quantitative and qualitative findings in other contexts.


2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Florens Rohde ◽  
Martin Franke ◽  
Ziad Sehili ◽  
Martin Lablans ◽  
Erhard Rahm

Abstract Background Data analysis for biomedical research often requires a record linkage step to identify records from multiple data sources referring to the same person. Due to the lack of unique personal identifiers across these sources, record linkage relies on the similarity of personal data such as first and last names or birth dates. However, the exchange of such identifying data with a third party, as is the case in record linkage, is generally subject to strict privacy requirements. This problem is addressed by privacy-preserving record linkage (PPRL) and pseudonymization services. Mainzelliste is an open-source record linkage and pseudonymization service used to carry out PPRL processes in real-world use cases. Methods We evaluate the linkage quality and performance of the linkage process using several real and near-real datasets with different properties w.r.t. size and error-rate of matching records. We conduct a comparison between (plaintext) record linkage and PPRL based on encoded records (Bloom filters). Furthermore, since the Mainzelliste software offers no blocking mechanism, we extend it by phonetic blocking as well as novel blocking schemes based on locality-sensitive hashing (LSH) to improve runtime for both standard and privacy-preserving record linkage. Results The Mainzelliste achieves high linkage quality for PPRL using field-level Bloom filters due to the use of an error-tolerant matching algorithm that can handle variances in names, in particular missing or transposed name compounds. However, due to the absence of blocking, the runtimes are unacceptable for real use cases with larger datasets. The newly implemented blocking approaches improve runtimes by orders of magnitude while retaining high linkage quality. Conclusion We conduct the first comprehensive evaluation of the record linkage facilities of the Mainzelliste software and extend it with blocking methods to improve its runtime. We observed a very high linkage quality for both plaintext as well as encoded data even in the presence of errors. The provided blocking methods provide order of magnitude improvements regarding runtime performance thus facilitating the use in research projects with large datasets and many participants.


Author(s):  
Josie Plachta ◽  
Charlie Tomlin ◽  
Rachel Shipsey

IntroductionData Linkage of hashed datasets is much more difficult than linking in-the-clear data. Hashing prevents the use of matching tools that overcome messy data such as ‘contained-within’ functions and edit distance metrics. Hashing sensitive data received from third parties is becoming more common due to increased Data Security concerns. Institutions need to be ready to link hashed data with high accuracy, otherwise the quality of outputs from these linked datasets will suffer. Objectives and ApproachWe designed an innovative matching method, Derive and Conquer (D&C). We derived variables containing substrings or patterns of the full variable (e.g. Soundex or first 4 characters of a string) to match on instead. However, using lots of combinations of these derived variables would require thousands of traditional match keys to be programmed, run, and reviewed. Instead, D&C runs matchkeys on a derived agreement variable which amalgamates information stored in multiple derived variables into one value, reducing the number of matchkeys to a manageable amount. D&C runs on distributing computing systems using PySpark to link datasets containing millions of records in a timely manner. ResultsD&C was developed using in-the-clear UK Census and health records with results comparable to the in-the-clear gold standard. It is currently being tested on hashed data to link UK tax and benefits data to UK health records. 66.4 million records were declared matched - a realistic match rate for the UK population. Research into the linkage quality is ongoing to produce estimates on the amount of bias in the linkage and the precision and recall. We will be excited to present these results at the Conference in October. These results will be used to improve D&C. Conclusion / ImplicationsUsing these derived variables, we have been able to overcome the challenge of matching massive hashed datasets with a realistic match rate and in a realistic time frame.


Author(s):  
Peter Christen ◽  
Thilina Ranbaduge ◽  
Rainer Schnell

IntroductionThe linking of sensitive databases containing personal identifying information across organisations is an increasingly important task in application domains ranging from health and social science research to national censuses. Various techniques have been developed to facilitate the linking of sensitive databases while at the same time preserving the privacy of individuals represented in these databases. Objectives and approachWe present several case studies where the privacy-preserving linking of sensitive databases is crucial, and then discuss the advantages and limitations of existing algorithms and techniques to link sensitive databases. We discuss privacy techniques such as Bloom filter encoding, hashing, and secure multi-party computation, from the point of view of a linkage practitioner. We highlight those aspects that are important when selecting or implementing a privacy-preserving linkage technique within practical applications. ResultsConceptually, linkage techniques can be evaluated across three main dimensions linkage quality, scalability to linking large or multiple databases, and the privacy protection provided by a technique. From a practical perspective, however, several other dimensions are crucial, including the availability of software or ease of implementation, technical knowledge available in an organisation, and the suitability of techniques for a given linkage scenario. Our analysis of a diverse range of linkage techniques has shown that currently no technique provides an adequate solution along all conceptual as well as all practical dimensions. ConclusionsMore research is required to develop novel techniques that facilitate the privacy-preserving linkage of large sensitive databases across organisations; including new encoding methods and cryptanalysis attacks (where until now most attacks have neglected the attack vectors that likely occur in practice), and novel evaluation measures to assess the privacy provided by linkage techniques. We encourage practitioners to be aware of the identified limitations – as well as the opportunities – of existing privacy-preserving linkage techniques and carefully assess the technical and organisational requirements of such techniques within their institution.


Author(s):  
Nick Von Sanden

IntroductionLinkage of Federal Government data in Australia is conducted primarily through Accredited Integrating Authorities (AIAs). These agencies hold different dataset from Commonwealth and state/territory government agencies. Historically, linkage projects involving data held by different AIAs has been inefficient, requiring the transfer of identifiable data between agencies, and relinking data that have already been linked by another agency. Objectives and ApproachTwo AIAs (the AIHW and ABS) have developed a system of interoperable linkage spines to address this issue. By using common datasets as a base, the agencies have improved the efficiency and security of linkage projects. This process was developed through an analysis of spine datasets, and two test projects to share data between the agencies. ResultsThe two test projects were successfully able to link cross-portfolio and cross-jurisdictional data without the need to share additional identifying information between the AIAs. Preliminary results suggest a high linkage rate from this process, and work is underway to quantify the linkage quality compared to traditional linkage methodologies. The ABS and AIHW are also investigating the implications for linkage quality as more datasets are included in the agencies’ linkage spines. Conclusion / ImplicationsThe success of this project will increase the efficiency of cross-jurisdictional and cross-portfolio linkage in Australia. It will also allow specialised AIAs to work on datasets where they have specific expertise, and feed these into broader projects. This is expected to have an additional impact on public trust in the linkage system, by minimising the sharing of personally identifiable information while still maintaining high quality linkage.


Author(s):  
Sean Randall ◽  
Adrian Brown ◽  
Anna Ferrante ◽  
James Boyd ◽  
Katie Irvine ◽  
...  

IntroductionWhile the quantity and type of datasets used by data linkage projects is growing, there remain some datasets that are ‘not available’ or ‘hard to access’ by researchers and linkers, either due to legal/regulatory constraints restricting the release of personally identifying information or because of privacy or reputational concerns. Advances in privacy-preserving record linkage methods (e.g. PPRL-Bloom) have made it possible to overcome this impasse. These techniques aim to provide strong privacy protection while still maintaining high linkage quality. PPRL-Bloom methods are being used in practice. The Centre for Data Linkage (CDL) at Curtin University has been involved in several PPRL linkage and evaluation projects using real-world data. As the methods are relatively new, published information on achievable linkage quality in real-world scenarios is limited. Objectives and ApproachWe present and describe several real-world applications of privacy preserving record linkage (PPRL-Bloom) where the quality of the linkage could be ascertained. In each case, data was linked ‘blind’; that is, without linkers having access to the original personal identifiers at any stage, or having any additional information about the records. Evaluations include a linkage of state-based morbidity and mortality records, a linkage of a number of general practice datasets to morbidity and emergency records, and a linkage of a range of state-based non-health administrative data, including education, police, housing, birth and child protection records. ResultsThe privacy preserving record linkage performed admirably, with very high-quality results across all evaluations. Conclusion / ImplicationsPrivacy preserving linkage is a useful and innovative methodology that is currently being used in real world projects. The results of these evaluation suggest it can be an appropriate linkage tool when legal or other constraints block release of personally identifying information to third party linkage units.


Sign in / Sign up

Export Citation Format

Share Document