Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets

Adrian P Brown,Sean M Randall,Rainer Schnell,Christian Borgs

doi:10.1186/s12911-017-0478-5

Adrian P Brown, Sean M Randall + Show 2 more

Open Access

PDF Available

https://doi.org/10.1186/s12911-017-0478-5

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundIntegrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research. Under many jurisdictions, unique personal identifiers needed for linking the records are unavailable. Since sensitive attributes, such as names, have to be used instead, privacy regulations usually demand encrypting these identifiers. The corresponding set of techniques for privacy-preserving record linkage (PPRL) has received widespread attention. One recent method is based on Bloom filters. Due to superior resilience against cryptographic attacks, composite Bloom filters (cryptographic long-term keys, CLKs) are considered best practice for privacy in PPRL. Real-world performance of these techniques using large-scale data is unknown up to now.MethodsUsing a large subset of Australian hospital admission data, we tested the performance of an innovative PPRL technique (CLKs using multibit trees) against a gold-standard derived from clear-text probabilistic record linkage. Linkage time and linkage quality (recall, precision and F-measure) were evaluated.ResultsClear text probabilistic linkage resulted in marginally higher precision and recall than CLKs. PPRL required more computing time but 5 million records could still be de-duplicated within one day. However, the PPRL approach required fine tuning of parameters.ConclusionsWe argue that increased privacy of PPRL comes with the price of small losses in precision and recall and a large increase in computational burden and setup time. These costs seem to be acceptable in most applied settings, but they have to be considered in the decision to apply PPRL. Further research on the optimal automatic choice of parameters is needed.

Highlights

Integrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research
For many research endeavors, linking the information needed would be trivial if a unique personal identifier (PID) is available
Since this requires the release of personally identifying information to trusted third parties [11], privacy regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rules [12] or current European Union (EU) regulations

Summary

Introduction

Integrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research. Unique personal identifiers needed for linking the records are unavailable. Since sensitive attributes, such as names, have to be used instead, privacy regulations usually demand encrypting these identifiers. For many research endeavors, linking the information needed would be trivial if a unique personal identifier (PID) is available. In many settings, legal and administrative issues prevent the use of PIDs, restricting data linkage to personal identifiers such as names. Since this requires the release of personally identifying information to trusted third parties [11], privacy regulations, such as the HIPAA Privacy Rules [12] or current EU regulations. Standard probabilistic record linkage methods [3] are sometimes unsuitable for methods based on encrypted identifiers

Methods

Results

Conclusion