A new hybrid record linkage process to make epidemiological databases interoperable: application to the GEMO and GENEPSO studies involving BRCA1 and BRCA2 mutation carriers

Yue Jiao,Nadine Andrieu,Dominique Stoppa‐Lyonnet ,Fabienne El-Khoury ,Anthony Laugé,Séverine Eon-Marchais,Lilian Laborde,Marie-Gabrielle Dondon,M Laurent ,Catherine Noguès,Juana Beauvallet,Chloé-Agathe Azencott,Noura Mebirouk,Sandrine M Caputo

doi:10.1186/s12874-021-01299-6

Abstract

BackgroundLinking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. We describe a hybrid record linkage process to link databases from two independent ongoing French national studies, GEMO (Genetic Modifiers of BRCA1 and BRCA2), which focuses on the identification of genetic factors modifying cancer risk of BRCA1 and BRCA2 mutation carriers, and GENEPSO (prospective cohort of BRCAx mutation carriers), which focuses on environmental and lifestyle risk factors.MethodsTo identify as many as possible of the individuals participating in the two studies but not registered by a shared identifier, we combined probabilistic record linkage (PRL) and supervised machine learning (ML). This approach (named “PRL + ML”) combined together the candidate matches identified by both approaches. We built the ML model using the gold standard on a first version of the two databases as a training dataset. This gold standard was obtained from PRL-derived matches verified by an exhaustive manual review. ResultsThe Random Forest (RF) algorithm showed a highest recall (0.985) among six widely used ML algorithms: RF, Bagged trees, AdaBoost, Support Vector Machine, Neural Network.Therefore, RF was selected to build the ML model since our goal was to identify the maximum number of true matches. Our combined linkage PRL + ML showed a higher recall (range 0.988–0.992) than either PRL (range 0.916–0.991) or ML (0.981) alone. It identified 1995 individuals participating in both GEMO (6375 participants) and GENEPSO (4925 participants).ConclusionsOur hybrid linkage process represents an efficient tool for linking GEMO and GENEPSO. It may be generalizable to other epidemiological studies involving other databases and registries.

Highlights

Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach
We conduct a study, in which we learn an machine learning (ML) model from a training set where the ground truth was established by probabilistic record linkage (PRL) followed by manual review
Ten matching variables shared between Genetic Modifiers of BRCA1 and BRCA2 (GEMO) and GENE PSO were used for comparison (Table 1): recruiting center number (CTR), family number (NUMFAM), individual number in the family (SUJID), gender (GENDER), year of birth (Yob), month of birth (Mob), day of birth (Dob), BRCA1 mutational status (BRCA1), BRCA2 mutational status (BRCA2) and mutation description using the Human Genome Variation Society (HGVS) nomenclature (MUT_HGVS)

Summary

Introduction

Linking independent sources of data describing the same individuals enable innovative epidemiological and health studies but require a robust record linkage approach. Record linkage is a process that allows to identify records appearing in different databases and referring to the same entity (e.g. an individual) [1], but which do not share a common unique identifier. The status of a pair of records is either matching (same individual) or non-matching (distinct individuals). This process consists in three successive steps: data preprocessing (curation of the data), record pair comparison and linkage. When no unique person identifier is shared between the two datasets, linkage has to be performed by comparison of shared matching variables. The record linkage matches may have two types of errors: False Positives (FP), i.e. true nonmatches classified as matches, and False Negatives (FN), i.e. true matches classified as non-matches

Objectives

Methods

Results

Discussion

Conclusion