scholarly journals A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers

2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Yue Jiao ◽  
Fabienne Lesueur ◽  
Chloé-Agathe Azencott ◽  
Maïté Laurent ◽  
Noura Mebirouk ◽  
...  

Abstract Background Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors. Methods To identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. Results The Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network. Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants). Conclusions Our hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.

2020 ◽  
Author(s):  
YUE JIAO ◽  
Fabienne Lesueur ◽  
Chloé-Agathe Azencott ◽  
Maïté Laurent ◽  
Noura Mebirouk ◽  
...  

Abstract BackgroundLinking independent sources of data related to same individuals enable innovative epidemiological and health studies but requires a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors.MethodsTo identify the maximum individuals participating in the two studies but may not be registered by a common number, we combined Probabilistic Record Linkage (PRL) and supervised Machine Learning (ML). This combined linkage was named “PRL+ML”. We built the ML model using a first version of the two databases as a training dataset on which matching status was assigned by PRL followed manual review. ResultsThe Random Forest (RF) algorithm showed a highest sensitivity (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network.Therefore, RF was selected to build the ML model since our goal was to identify the maximum of true matches. Our combined linkage PRL+ML showed a higher sensitivity (range 0.988-0.992) than either PRL (range 0.916-0.991) or ML (0.981) alone. It identified 2,068 individuals participating in both GEMO (6,375 participants) and GENEPSO (4,925 participants).ConclusionsOur hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.


2014 ◽  
Vol 24 (1) ◽  
pp. 308-316 ◽  
Author(s):  
Paolo Peterlongo ◽  
Jenny Chang-Claude ◽  
Kirsten B. Moysich ◽  
Anja Rudolph ◽  
Rita K. Schmutzler ◽  
...  

2021 ◽  
Vol 132 ◽  
pp. S357-S358
Author(s):  
Shana Kim ◽  
Jan Lubinski ◽  
Tomasz Huzarski ◽  
Pal Moller ◽  
Susan Armel ◽  
...  

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Juliette Coignard ◽  
◽  
Michael Lush ◽  
Jonathan Beesley ◽  
Tracy A. O’Mara ◽  
...  

AbstractBreast cancer (BC) risk for BRCA1 and BRCA2 mutation carriers varies by genetic and familial factors. About 50 common variants have been shown to modify BC risk for mutation carriers. All but three, were identified in general population studies. Other mutation carrier-specific susceptibility variants may exist but studies of mutation carriers have so far been underpowered. We conduct a novel case-only genome-wide association study comparing genotype frequencies between 60,212 general population BC cases and 13,007 cases with BRCA1 or BRCA2 mutations. We identify robust novel associations for 2 variants with BC for BRCA1 and 3 for BRCA2 mutation carriers, P < 10−8, at 5 loci, which are not associated with risk in the general population. They include rs60882887 at 11p11.2 where MADD, SP11 and EIF1, genes previously implicated in BC biology, are predicted as potential targets. These findings will contribute towards customising BC polygenic risk scores for BRCA1 and BRCA2 mutation carriers.


2015 ◽  
Vol 151 (3) ◽  
pp. 653-660 ◽  
Author(s):  
Tehillah S. Menes ◽  
Mary Beth Terry ◽  
David Goldgar ◽  
Irene L. Andrulis ◽  
Julia A. Knight ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document