A primer on probabilistic record linkage

Reclink: aplicativo para o relacionamento de bases de dados, implementando o método probabilistic record linkage

Cadernos de Saúde Pública ◽

10.1590/s0102-311x2000000200014 ◽

2000 ◽

Vol 16 (2) ◽

pp. 439-447 ◽

Cited By ~ 85

Author(s):

Kenneth R. de Camargo Jr. ◽

Cláudia M. Coeli

Keyword(s):

Record Linkage ◽

Probabilistic Record Linkage ◽

A Performance

Apresenta-se um sistema de relacionamento de bases de dados fundamentado na técnica de relacionamento probabilístico de registros, desenvolvido na linguagem C++ com o ambiente de programação Borland C++ Builder versão 3.0. O sistema foi testado a partir de fontes de dados de diferentes tamanhos, tendo sido avaliado em tempo de processamento e sensibilidade para a identificação de pares verdadeiros. O tempo gasto com o processamento dos registros foi menor quando se empregou o programa do que ao ser realizado manualmente, em especial, quando envolveram bases de maior tamanho. As sensibilidades do processo manual e do processo automático foram equivalentes quando utilizaram bases com menor número de registros; entretanto, à medida que as bases aumentaram, percebeu-se tendência de diminuição na sensibilidade apenas no processo manual. Ainda que em fase inicial de desenvolvimento, o sistema apresentou boa performance tanto em velocidade quanto em sensibilidade. Embora a performance dos algoritmos utilizados tenha sido satisfatória, o objetivo é avaliar outras rotinas, buscando aprimorar o desempenho do sistema.

Download Full-text

HIV/AIDS coinfection with the hepatitis B and C viruses in Brazil

Cadernos de Saúde Pública ◽

10.1590/0102-311x00010413 ◽

2014 ◽

Vol 30 (2) ◽

pp. 433-438 ◽

Cited By ~ 14

Author(s):

Silvano Barbosa de Oliveira ◽

Edgar Merchan-Hamann ◽

Leila Denise Alves Ferreira Amorim

Keyword(s):

Hepatitis B ◽

Record Linkage ◽

Ministry Of Health ◽

Aids Patients ◽

Probabilistic Record Linkage ◽

Epidemiologic Surveillance ◽

Hiv Virus ◽

Epidemiological Profile ◽

Hiv Aids

The aim of this study is to estimate the prevalence of HIV/HBV and HIV/HCV coinfections among AIDS cases reported in Brazil, and to describe the epidemiological profile of these cases. Coinfection was identified through probabilistic record linkage of the data of all patients carrying the HIV virus recorded as AIDS patients and of those patients reported as carriers of hepatitis B or C virus in various databases from the Brazilian Ministry of Health from 1999 to 2010. In this period 370,672 AIDS cases were reported, of which 3,724 were HIV/HBV coinfections. Women are less likely to become coinfected than men and the chance of coinfection increases with age. This study allowed an important evaluation of HBV/HIV and HCV/HIV coinfections in Brazil using information obtained via merging secondary databases from the Ministry of Health, without conducting seroprevalence research. The findings of this study might be important for planning activities of the Brazilian epidemiologic surveillance agencies.

Download Full-text

Record Linkage Methodology for the Social Data Linkage Environment at Statistics Canada

International Journal for Population Data Science ◽

10.23889/ijpds.v1i1.49 ◽

2017 ◽

Vol 1 (1) ◽

Author(s):

Colin Babyak ◽

Abdelnasser Saidi

Keyword(s):

Computational Complexity ◽

Record Linkage ◽

False Positive ◽

Linked Data ◽

Data Linkage ◽

Social Data ◽

Probabilistic Record Linkage ◽

The Social ◽

Data Source ◽

Statistics Canada

ABSTRACTObjectivesThe objectives of this talk are to introduce Statistics Canada’s Social Data Linkage Environment (SDLE) and to explain the methodology behind the creation of the central depository and how both deterministic and probabilistic record linkage techniques are used to maintain and expand the environment.ApproachWe will start with a brief overview of the SDLE and then continue with a discussion of how both deterministic linkages and probabilistic linkages (using Statistic Canada’s generalized record linkage software, G-Link) have been combined to create and maintain a very large central depository, which can in turn be linked to virtually any social data source for the ultimate end goal of analysis.ResultsAlthough Canada has a population of about 36 million people, the central depository contains some 300 million records to represent them, due to multiple addresses, names, etc. Although this allows for a significant reduction in missing links, it raises the spectre of additional false positive matches and has added computational complexity which we have had to overcome.ConclusionThe combination of deterministic and probabilistic record linkage strategies has been effective in creating the central depository for the SDLE. As more and more data are linked to the environment and we continue to refine our methodology, we can now move on to the ultimate goal of the SDLE, which is to analyze this vast wealth of linked data.

Download Full-text

Probabilistic record linkage

International Journal of Epidemiology ◽

10.1093/ije/dyv322 ◽

2015 ◽

Vol 45 (3) ◽

pp. 954-964 ◽

Cited By ~ 57

Author(s):

Adrian Sayers ◽

Yoav Ben-Shlomo ◽

Ashley W Blom ◽

Fiona Steele

Keyword(s):

Record Linkage ◽

Probabilistic Record Linkage

Download Full-text

Field Weights Computation for Probabilistic Record Linkage in Presence of Missing Data

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001420590466 ◽

2020 ◽

Vol 34 (14) ◽

pp. 2059046

Author(s):

Yinghao Zhang ◽

Senlin Xu ◽

Mingfan Zheng ◽

Xinran Li

Keyword(s):

Missing Data ◽

Record Linkage ◽

Data Sources ◽

Computational Time ◽

Missing Value ◽

Probabilistic Record Linkage ◽

Linkage Method ◽

Field Similarity ◽

Record Pair

Record linkage is the task for identifying which records refer to the same entity. When records in different data sources do not have a common key and they contain typographical errors in their identifier fields, the extended Fellegi–Sunter probabilistic record linkage method with consideration of field similarity proposed by Winkler, is one of the most effective methods to perform record linkage to our knowledge. But this method has a limitation that it cannot efficiently handle the problem of missing value in the fields, an inappropriate weight is assigned to record pair containing missing data. Therefore, to improve the performance of Winkler’s probabilistic record linkage method in presence of missing value, we proposed a solution for adjusting record pair’s weight when missing data occurred, which allows enhancing the accuracy of the Winkler’s record linkage decisions without increasing much more computational time.

Download Full-text

Comparison of Public-Domain Software and Services For Probabilistic Record Linkage and Address Standardization

Towards Integrative Machine Learning and Knowledge Extraction - Lecture Notes in Computer Science ◽

10.1007/978-3-319-69775-8_3 ◽

2017 ◽

pp. 51-66

Author(s):

Sou-Cheng T. Choi ◽

Yongheng Lin ◽

Edward Mulrow

Keyword(s):

Record Linkage ◽

Public Domain ◽

Probabilistic Record Linkage ◽

Public Domain Software

Download Full-text

An Introduction to Probabilistic Record Linkage with a Focus on Linkage Processing for WTC Registries

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph17186937 ◽

2020 ◽

Vol 17 (18) ◽

pp. 6937

Author(s):

Jana Asher ◽

Dean Resnick ◽

Jennifer Brite ◽

Robert Brackbill ◽

James Cone

Keyword(s):

World War Ii ◽

Record Linkage ◽

Historical Context ◽

World War ◽

Privacy Concerns ◽

Data Standardization ◽

Probabilistic Record Linkage ◽

Academic Fields ◽

Different Types

Since its post-World War II inception, the science of record linkage has grown exponentially and is used across industrial, governmental, and academic agencies. The academic fields that rely on record linkage are diverse, ranging from history to public health to demography. In this paper, we introduce the different types of data linkage and give a historical context to their development. We then introduce the three types of underlying models for probabilistic record linkage: Fellegi-Sunter-based methods, machine learning methods, and Bayesian methods. Practical considerations, such as data standardization and privacy concerns, are then discussed. Finally, recommendations are given for organizations developing or maintaining record linkage programs, with an emphasis on organizations measuring long-term complications of disasters, such as 9/11.

Download Full-text

An Efficient Validation Method of Probabilistic Record Linkage Including Readmissions and Twins

Methods of Information in Medicine ◽

10.3414/me0489 ◽

2008 ◽

Vol 47 (04) ◽

pp. 356-363 ◽

Cited By ~ 53

Author(s):

A. C. J. Ravelli ◽

N. Méray ◽

J. B. Reitsma ◽

G. J. Bonsel ◽

M. Tromp

Keyword(s):

Record Linkage ◽

External Validation ◽

Sample Selection ◽

External Information ◽

Multiple Birth ◽

Double Blind ◽

Probabilistic Record Linkage ◽

Registry File ◽

Validation Procedure

Summary Objective: To describe an efficient, generalizable approach to validate probabilistic record linkage results, in particular by a model-guided detection of linking errors, and to apply this approach to validate linkage of admissions of newborns. Methods: Our double-blind validation procedure consisted of three steps: sample selection, data collection and data analysis. The linked Dutch national newborn admission registry contained 30,082 records for 2001 including readmissions (7.4%) and twins (9.7%). A highly informative sample was selected from the linked file by oversampling uncertain links based on modelderived linking weight. Four hundred and eight fax forms with minimal registry information (admissions of 191 children) were sent out to different pediatric units. The pediatricians were asked to create a short detailed patient history from independent sources. The linkage status and additional record data was validated against this external information. Results: Response rate was 97% (395/408 faxes). Accuracy of the linkage of singleton admissions was high: except for some expected errors in the uncertain area (0.02% of record pairs), linkage was error-free. Validation of multiple birth readmissions showed 37% linkage errors due to low data quality of the multiple birth variables. The quality of the linked registry file was still high; only 1.7% of the children were from a multiple birth with multiple admissions, resulting in less than 1% linking error. Conclusions: Our external validation procedure of record linkage was feasible, efficient, and informative about identifying the source of the errors.

Download Full-text

Record Linkage in the Cancer Registry of Tyrol, Austria

Methods of Information in Medicine ◽

10.1055/s-0038-1634018 ◽

2005 ◽

Vol 44 (05) ◽

pp. 626-630 ◽

Cited By ~ 19

Author(s):

W. Stühlinger ◽

W. Oberaigner

Keyword(s):

Cancer Registry ◽

Record Linkage ◽

Cancer Registries ◽

Patient Data ◽

Medical System ◽

Data Sources ◽

Patient Registration ◽

Probabilistic Record Linkage ◽

Linkage Method ◽

Sufficient Precision

Summary Objective: Record linkage of patient data originating from various data sources and record linkage for checking uniqueness of patient registration are common tasks for every cancer registry. In Austria, there is no unique person identifier in use in the medical system. Hence, it was necessary and the goal of this work to develop an efficient means of record linkage for use in cancer registries in Austria. Methods: We adapted the method of probabilistic record linkage to the situation of cancer registries in Austria. In addition to the customary components of this method, we also took into consideration typing errors commonly occurring in names and dates of birth. The method was implemented in a program written in DELPHITM with interfaces optimised for cancer registries. Results: Applying our record linkage method to 130,509 linkages results in 105,272 (80.7%) identical pairs. For these identical pairs, 88.9% of decisions were performed automatically and 11.1% semi-automatically. For results decided automatically, 6.9% did not have simultaneous identity of last name, first name and date of birth. For results decided semi-automatically, 48.4% did not have an identical last name, 25.6% did not have an identical date of birth and 83.1% did not have simultaneous identity of last name and date of birth and first name. Conclusions: The method implemented in our cancer registry solves all record linkage problems in Austria with sufficient precision.

Download Full-text