Abstract

Electronic Health Records (EHRs) enable the sharing of patients’ medical data. Since EHRs include patients’ private data, access by researchers is restricted. Therefore k-anonymity is necessary to keep patients’ private data safe without damaging useful medical information. However, k-anonymity cannot prevent sensitive attribute disclosure. An alternative, l-diversity, has been proposed as a solution to this problem and is defined as: each Q-block (ie, each set of rows corresponding to the same value for identifiers) contains at least l well-represented values for each sensitive attribute. While l-diversity protects against sensitive attribute disclosure, it is limited in that it focuses only on diversifying sensitive attributes. The aim of the study is to develop a k-anonymity method that not only minimizes information loss but also achieves diversity of the sensitive attribute. This paper proposes a new privacy protection method that uses conditional entropy and mutual information. This method considers both information loss as well as diversity of sensitive attributes. Conditional entropy can measure the information loss by generalization, and mutual information is used to achieve the diversity of sensitive attributes. This method can offer appropriate Q-blocks for generalization. We used the adult database from the UCI Machine Learning Repository and found that the proposed method can greatly reduce information loss compared with a recent l-diversity study. It can also achieve the diversity of sensitive attributes by counting the number of Q-blocks that have leaks of diversity. This study provides a privacy protection method that can improve data utility and protect against sensitive attribute disclosure. The method is viable and should be of interest for further privacy protection in EHR applications.

Highlights

  • Society is experiencing exponential growth in the amount of health information

  • More of the same attribute values can be obtained by increasing the number of instances, in which case conditional entropy is zero, so total information loss is not increased

  • Total information loss of entropy l-diversity, t-closeness, and the proposed method is increased in response to the larger number of instances

Read more

Summary

Introduction

Society is experiencing exponential growth in the amount of health information. this information is distributed across multiple sites, held in a variety of paper and electronic formats, and represented as mixtures of narrative and structured data. Electronic Health Records (EHRs) have been introduced as a method for improving communication between health care providers and improving access to patient data This use of EHRs has enabled large and complicated databases of health records to be used for medical and other research. For patient health information to be de-identified, the Health Insurance Portability and Accountability Act (HIPAA) in the United States suggests the “Safe Harbor” technique, which requires 18 data elements to be removed [5,6]. Doing this can protect the confidentiality and privacy of research subjects. De-identification methods have tended to be quite faulty as the possibility remains of re-identifying a patient by linking or matching the data to other data or by looking at unique characteristics found in the released data

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call