Abstract

Distributed health data networks (DHDNs) leverage data from multiple sources or sites such as electronic health records (EHRs) from multiple healthcare systems and have drawn increasing interests in recent years, as they do not require sharing of subject-level data and hence lower the hurdles for collaboration between institutions considerably. However, DHDNs face a number of challenges in data analysis, particularly in the presence of missing data. The current state-of-the-art methods for handling incomplete data require pooling data into a central repository before analysis, which is not feasible in DHDNs. In this paper, we address the missing data problem in distributed environments such as DHDNs that has not been investigated previously. We develop communication-efficient distributed multiple imputation methods for incomplete data that are horizontally partitioned. Since subject-level data are not shared or transferred outside of each site in the proposed methods, they enhance protection of patient privacy and have the potential to strengthen public trust in analysis of sensitive health data. We investigate, through extensive simulation studies, the performance of these methods. Our methods are applied to the analysis of an acute stroke dataset collected from multiple hospitals, mimicking a DHDN where health data are horizontally partitioned across hospitals and subject-level data cannot be shared or sent to a central data repository.

Highlights

  • Distributed health data networks (DHDNs) leverage data from multiple sources or sites such as electronic health records (EHRs) from multiple healthcare systems and have drawn increasing interests in recent years, as they do not require sharing of subject-level data and lower the hurdles for collaboration between institutions considerably

  • We conduct simulation studies to investigate strengths and limitations of the four privacy-preserving distributed Multiple imputation (MI) methods described in the section “Methods” under the missing at random (MAR) assumption

  • To benchmark the performance of the distributed MI methods, we compare their results with the results from the complete data (CD) analysis which fits the analysis model using the full data before missing values are generated, and the results from the complete case analysis which fits the analysis model using only the set of complete cases that have all variables observed after missing values are generated

Read more

Summary

Introduction

Distributed health data networks (DHDNs) leverage data from multiple sources or sites such as electronic health records (EHRs) from multiple healthcare systems and have drawn increasing interests in recent years, as they do not require sharing of subject-level data and lower the hurdles for collaboration between institutions considerably. The current standard practice of data de-identification through removing indiviual identifiers is inadequate for privacy protection in the era of big data, as a large body of research has demonstrated that given some background information of an individual, an adversary can learn (from “de-identified” data) sensitive information about the victim[2,3,4,5,6] To address these challenges, distributed health data networks (DHDNs) that can store and analyze EHRs data from multiple sites without sharing individual-level data have drawn increasing interests in recent years[7,8]. In the presence of general missing data patterns, the MI by chained equations (MICE) method is widely adopted and has been shown to achieve superior performance in practice[19,20]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call