scholarly journals Probabilistic record linkage and an automated procedure to minimize the undecided-matched pair problem

2004 ◽  
Vol 20 (4) ◽  
pp. 915-925 ◽  
Author(s):  
Carla Jorge Machado ◽  
Kenneth Hill

Probabilistic record linkage allows the assembling of information from different data sources. We present a procedure when a one-to-one relationship between records in different files is expected but not found. Data were births and infant deaths, 1998-birth cohort, city of São Paulo, Brazil. Pairs for which a one-to-one relationship was obtained and a best-link was found with the highest weight were taken as unequivocally matched pairs and provided information to decide on the remaining pairs. For these, an expected relationship between differences in dates of death and birth registration was found; and places of birth and death registration for neonatal deaths were likely to be the same. Such evidence was used to solve for the remaining pairs. We reduced the number of non-uniquely matched records and of uncertain matches, and increased the number of uniquely matched pairs from 2,249 to 2,827. Future research using record linkage should use strategies from first record linkage runs before a full clerical review (the standard procedure under uncertainty) to efficiently retrieve matches.

Author(s):  
Marcos Barreto ◽  
André Alves ◽  
Samila Sena ◽  
Rosemeire Fiaccone ◽  
Leila Amorim ◽  
...  

ABSTRACT Background and aimsThe Brazilian government has several social protection programmes that select their beneficiaries based on socioeconomic information kept in the CadastroÚnico (CADU) database. The CADU will be used to build a population-based cohort of approximately 100 million individuals. Among the social programmes is the Bolsa Família (PBF), a conditional cash transfer programme that provides extra income to poor families. These two databases must be deterministically linked to individuals who have received payments from PBF between 2004 and 2012. It will be used in epidemiological studies aiming to assess the impact of PBF on the occurrence and severity of several diseases and health problems (tuberculosis, leprosy, HIV, child health etc). This cohort must be probabilistically linked with databases from the Unified Health System (SUS), such as hospitalization, notifiable diseases, mortality, and live births, in order to produce data marts (domain-specific data) to the proposed studies. Our goals comprise the validation of probabilistic record linkage methods to support this cohort setup. ApproachThis paper emphasizes the accuracy assessment of our methods based on the linkage of SIH (hospitalization), SINAN (notifications), and SIM (mortality) records to the 2011 extraction of CADU. We focused on hospitalization and notification of tuberculosis, as well infant mortality for all causes in under-4 children, for a small sample with 30,029 records (CADU). Due to the absence of gold standards, we used two approaches to assess accuracy: a clerical review and an automatic (tool-based) search. In the first case, we used different cut-off points as similarity index to calculate sensitivity and specificity, and a ROC curve to separate matched and non-matched pairs. The second approach retrieves from CADU all matched and non-matched pairs for a given individual, serving as a gold standard for validation. ResultsWe retrieved 22 linked pairs, from which 18 are true positives for infant mortality (SIM database). From SINAN, our results were 434 linked pairs with 166 true positives, and with SIH, 121 linked pairs with 34 true positives. The sensitivity of manual scan for SIM (children mortality) ranges from 44% (specificity of 100%) to 95% (specificity of 94%), with similarity indices between 0.80 and 0.97, respectively. For automatic search, we obtained a sensitivity of 69.2% and specificity of 91.8%. ConclusionOur results show the need for a continuous improvement in our linkage routines and how to consistently evaluate their accuracy in the absence of adequate gold standards.


2021 ◽  
pp. 1-11
Author(s):  
Charles Salame ◽  
Inti Gonzalez ◽  
Rodrigo Gomez-Fell ◽  
Ricardo Jaña ◽  
Jorge Arigony-Neto

Abstract This paper provides the first evidence for sea-ice formation in the Cordillera Darwin (CD) fjords in southern Chile, which is farther north than sea ice has previously been reported for the Southern Hemisphere. Initially observed from a passenger plane in September 2015, the presence of sea ice was then confirmed by aerial reconnaissance and subsequently identified in satellite imagery. A time series of Sentinel-1 and Landsat-8 images during austral winter 2015 was used to examine the chronology of sea-ice formation in the Cuevas fjord. A longer time series of imagery across the CD was analyzed from 2000 to 2017 and revealed that sea ice had formed in each of the 13 fjords during at least one winter and was present in some fjords during a majority of the years. Sea ice is more common in the northern end of the CD, compared to the south where sea ice is not typically present. Is suggested that surface freshening from melting glaciers and high precipitation reduces surface salinity and promotes sea-ice formation within the semi-enclosed fjord system during prolonged periods of cold air temperatures. This is a unique set of initial observations that identify questions for future research in this remote area.


2000 ◽  
Vol 16 (2) ◽  
pp. 439-447 ◽  
Author(s):  
Kenneth R. de Camargo Jr. ◽  
Cláudia M. Coeli

Apresenta-se um sistema de relacionamento de bases de dados fundamentado na técnica de relacionamento probabilístico de registros, desenvolvido na linguagem C++ com o ambiente de programação Borland C++ Builder versão 3.0. O sistema foi testado a partir de fontes de dados de diferentes tamanhos, tendo sido avaliado em tempo de processamento e sensibilidade para a identificação de pares verdadeiros. O tempo gasto com o processamento dos registros foi menor quando se empregou o programa do que ao ser realizado manualmente, em especial, quando envolveram bases de maior tamanho. As sensibilidades do processo manual e do processo automático foram equivalentes quando utilizaram bases com menor número de registros; entretanto, à medida que as bases aumentaram, percebeu-se tendência de diminuição na sensibilidade apenas no processo manual. Ainda que em fase inicial de desenvolvimento, o sistema apresentou boa performance tanto em velocidade quanto em sensibilidade. Embora a performance dos algoritmos utilizados tenha sido satisfatória, o objetivo é avaliar outras rotinas, buscando aprimorar o desempenho do sistema.


2014 ◽  
Vol 30 (2) ◽  
pp. 433-438 ◽  
Author(s):  
Silvano Barbosa de Oliveira ◽  
Edgar Merchan-Hamann ◽  
Leila Denise Alves Ferreira Amorim

The aim of this study is to estimate the prevalence of HIV/HBV and HIV/HCV coinfections among AIDS cases reported in Brazil, and to describe the epidemiological profile of these cases. Coinfection was identified through probabilistic record linkage of the data of all patients carrying the HIV virus recorded as AIDS patients and of those patients reported as carriers of hepatitis B or C virus in various databases from the Brazilian Ministry of Health from 1999 to 2010. In this period 370,672 AIDS cases were reported, of which 3,724 were HIV/HBV coinfections. Women are less likely to become coinfected than men and the chance of coinfection increases with age. This study allowed an important evaluation of HBV/HIV and HCV/HIV coinfections in Brazil using information obtained via merging secondary databases from the Ministry of Health, without conducting seroprevalence research. The findings of this study might be important for planning activities of the Brazilian epidemiologic surveillance agencies.


Author(s):  
Colin Babyak ◽  
Abdelnasser Saidi

ABSTRACTObjectivesThe objectives of this talk are to introduce Statistics Canada’s Social Data Linkage Environment (SDLE) and to explain the methodology behind the creation of the central depository and how both deterministic and probabilistic record linkage techniques are used to maintain and expand the environment.ApproachWe will start with a brief overview of the SDLE and then continue with a discussion of how both deterministic linkages and probabilistic linkages (using Statistic Canada’s generalized record linkage software, G-Link) have been combined to create and maintain a very large central depository, which can in turn be linked to virtually any social data source for the ultimate end goal of analysis.ResultsAlthough Canada has a population of about 36 million people, the central depository contains some 300 million records to represent them, due to multiple addresses, names, etc. Although this allows for a significant reduction in missing links, it raises the spectre of additional false positive matches and has added computational complexity which we have had to overcome.ConclusionThe combination of deterministic and probabilistic record linkage strategies has been effective in creating the central depository for the SDLE. As more and more data are linked to the environment and we continue to refine our methodology, we can now move on to the ultimate goal of the SDLE, which is to analyze this vast wealth of linked data.


2015 ◽  
Vol 45 (3) ◽  
pp. 954-964 ◽  
Author(s):  
Adrian Sayers ◽  
Yoav Ben-Shlomo ◽  
Ashley W Blom ◽  
Fiona Steele

2022 ◽  
pp. 107780122110706
Author(s):  
Sarah E. Ullman ◽  
Emily A. Waterman ◽  
Katie M. Edwards ◽  
Jania Marshall ◽  
Christina M. Dardis ◽  
...  

The current arricle describes a novel recruitment protocol for collecting data from sexual assault and intimate partner violence survivors referred to research studies by individuals to whom they had previously disclosed. Challenges in both recruiting participants and interpreting data are described. Only 35.8% of cases had usable data for both survivors and disclosure recipients, suggesting that this referral method had limited success in recruiting matched pairs. Suggestions for modifications to improve the protocol for future research are offered. Potential advantages and drawbacks of various methods for recruiting dyads are described in order to facilitate future research on survivors’ disclosure processes, social reactions, and the influence of social reactions on survivor recovery.


Author(s):  
Yinghao Zhang ◽  
Senlin Xu ◽  
Mingfan Zheng ◽  
Xinran Li

Record linkage is the task for identifying which records refer to the same entity. When records in different data sources do not have a common key and they contain typographical errors in their identifier fields, the extended Fellegi–Sunter probabilistic record linkage method with consideration of field similarity proposed by Winkler, is one of the most effective methods to perform record linkage to our knowledge. But this method has a limitation that it cannot efficiently handle the problem of missing value in the fields, an inappropriate weight is assigned to record pair containing missing data. Therefore, to improve the performance of Winkler’s probabilistic record linkage method in presence of missing value, we proposed a solution for adjusting record pair’s weight when missing data occurred, which allows enhancing the accuracy of the Winkler’s record linkage decisions without increasing much more computational time.


Check List ◽  
2015 ◽  
Vol 11 (1) ◽  
pp. 1554
Author(s):  
Rafaela Lima de Farias ◽  
Thuanny Fernanda Braga Alencar ◽  
Elvio S.F. Medeiros

The present study describes a new site of occurrence for the genus Lopescladius in Brazil and reports the first record for the Piranhas-Açu River basin, in the state of Rio Grande do Norte, northeastern Brazil. This new occurrence expands the distribution of the genus and adds to the knowledge of the chironomid fauna. The presence of this genus in an intermittent stream highlights the importance of future research on this type of aquatic system as well as ecological aspects related to Lopescladius.


Sign in / Sign up

Export Citation Format

Share Document