Abstract
BackgroundIntegrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research. Under many jurisdictions, unique personal identifiers needed for linking the records are unavailable. Since sensitive attributes, such as names, have to be used instead, privacy regulations usually demand encrypting these identifiers. The corresponding set of techniques for privacy-preserving record linkage (PPRL) has received widespread attention. One recent method is based on Bloom filters. Due to superior resilience against cryptographic attacks, composite Bloom filters (cryptographic long-term keys, CLKs) are considered best practice for privacy in PPRL. Real-world performance of these techniques using large-scale data is unknown up to now.MethodsUsing a large subset of Australian hospital admission data, we tested the performance of an innovative PPRL technique (CLKs using multibit trees) against a gold-standard derived from clear-text probabilistic record linkage. Linkage time and linkage quality (recall, precision and F-measure) were evaluated.ResultsClear text probabilistic linkage resulted in marginally higher precision and recall than CLKs. PPRL required more computing time but 5 million records could still be de-duplicated within one day. However, the PPRL approach required fine tuning of parameters.ConclusionsWe argue that increased privacy of PPRL comes with the price of small losses in precision and recall and a large increase in computational burden and setup time. These costs seem to be acceptable in most applied settings, but they have to be considered in the decision to apply PPRL. Further research on the optimal automatic choice of parameters is needed.
Highlights
Integrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research
For many research endeavors, linking the information needed would be trivial if a unique personal identifier (PID) is available
Since this requires the release of personally identifying information to trusted third parties [11], privacy regulations, such as the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rules [12] or current European Union (EU) regulations
Summary
Integrating medical data using databases from different sources by record linkage is a powerful technique increasingly used in medical research. Unique personal identifiers needed for linking the records are unavailable. Since sensitive attributes, such as names, have to be used instead, privacy regulations usually demand encrypting these identifiers. For many research endeavors, linking the information needed would be trivial if a unique personal identifier (PID) is available. In many settings, legal and administrative issues prevent the use of PIDs, restricting data linkage to personal identifiers such as names. Since this requires the release of personally identifying information to trusted third parties [11], privacy regulations, such as the HIPAA Privacy Rules [12] or current EU regulations. Standard probabilistic record linkage methods [3] are sometimes unsuitable for methods based on encrypted identifiers
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.