Linkage among healthcare databases, claims, and external data collections such as registries, cohorts or clinical trials is becoming increasingly important for conducting epidemiological research. For example, the French National Health Insurance Information System (SNDS) contains claims-based data on dispensed drugs, on out-of-hospital laboratory tests and on outpatient medical consultations, but without tests results, diagnosis or even the symptoms motivating the patient's consultation. On the other hand, registries, cohorts or clinical trials collect medically validated data, which is time-consuming and costly. The linkage of the two sources could therefore overcome lack of information and limit confounding biases. However, processing record linkage on huge health databases and assessing linkage quality can be challenging. This paper introduces a deterministic record linkage strategy that focuses on assessing linkage quality using new quality metrics. We developed a deterministic linkage strategy that systematically considers all combinations of individual identifiers. An exhaustive exploration of all variable combinations makes it possible to compute a new metric, referred to as robustness, and to generate a linkage cartography that precisely summarizes the linked pair characteristics. This cartography is central to our approach and makes it possible for the expert to easily accept/reject groups of linked pairs. The approach was tested on synthetic datasets staging a variety of possible linkage scenarios (datasets size/ratio, overlap, and errors), and on two real-world studies (a registry database and a clinical trial). Dataset simulations demonstrated very good accuracy with a limited impact of different factors tested, scalability, and encouraging runtimes. Minima were greater than 0.95 for recall and greater than 0.99 for precision, whatever the scenario. Feasibility on real datasets was verified with good results: among 3985 patients from the registry, the algorithm found 3850 single linked pairs and 135 proposals with multiple candidates out of 504,795 candidates. After reviewing the linkage cartography, the expert validated 3783 linked pairs and a manual review of multiple candidates added 20 pairs, reaching a linkage rate of 95.4%. For the trial, only 2 records out of 129 were not linked among 22,426 candidates, as a result of early withdrawal (no information in the trial database) giving a linkage rate of 98.4%. In both cases, unlinked records did not seem to show any potential bias. Performance is good since linking a synthetic database holding 30,000 patients versus a synthetic database holding 3,000,000 patients takes a few seconds only. The approach is resilient by design to missing information and therefore well suited for linking the SNDS database to cohorts/registries when information overlap between the two databases cannot be perfect. Finally, our implementation is fast enough to interactively improve the linkage results through successive improved runs. The novelty of our approach is twofold: first, the linkage cartography provides a new way of classifying and comparing deterministic rules from the set of all possible rules and second, the approach is by design resilient to data corruption and can reach better recall than standard deterministic linkage strategies. Finally, good performance and scalability open the door to the linkage of very large datasets. Les auteurs n'ont pas précisé leurs éventuels liens d'intérêts.
Read full abstract