Abstract

BackgroundTechniques have been developed to compute statistics on distributed datasets without revealing private information except the statistical results. However, duplicate records in a distributed dataset may lead to incorrect statistical results. Therefore, to increase the accuracy of the statistical analysis of a distributed dataset, secure deduplication is an important preprocessing step.MethodsWe designed a secure protocol for the deduplication of horizontally partitioned datasets with deterministic record linkage algorithms. We provided a formal security analysis of the protocol in the presence of semi-honest adversaries. The protocol was implemented and deployed across three microbiology laboratories located in Norway, and we ran experiments on the datasets in which the number of records for each laboratory varied. Experiments were also performed on simulated microbiology datasets and data custodians connected through a local area network.ResultsThe security analysis demonstrated that the protocol protects the privacy of individuals and data custodians under a semi-honest adversarial model. More precisely, the protocol remains secure with the collusion of up to N − 2 corrupt data custodians. The total runtime for the protocol scales linearly with the addition of data custodians and records. One million simulated records distributed across 20 data custodians were deduplicated within 45 s. The experimental results showed that the protocol is more efficient and scalable than previous protocols for the same problem.ConclusionsThe proposed deduplication protocol is efficient and scalable for practical uses while protecting the privacy of patients and data custodians.

Highlights

  • The focus of this paper is the reuse of health data horizontally partitioned between data custodians, such that each data custodian provides the same attributes for a set of patients

  • Security analysis We prove the security of the proposed protocol in the presence of corrupt data custodians or a corrupt coordinator who tries to learn information as a result of the protocol execution

  • For an adversary that controls a set of data custodians, the adversary’s view is a combination of the corrupt data custodians’ views

Read more

Summary

Introduction

The focus of this paper is the reuse of health data horizontally partitioned between data custodians, such that each data custodian provides the same attributes for a set of patients. Reusing data from multiple data custodians provides a sufficient number of patients who satisfy the inclusion criteria of a particular study. When data are collected across multiple data custodians, the data of a heterogeneous mix of patients can be reused. There has been substantial interest in the reuse of EHR data for public health surveillance, which requires data from multiple data custodians covering the geographic area of interest [5,6,7]. The increased adoption of EHR systems has led, and continues to lead, to the collection of large amounts of health data [1]. Survey, and registry data are being collected. These data could aid in the development of

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.