Hiding sensitive information in eHealth datasets

Jimmy Ming-Tai Wu,Gautam Srivastava,Alireza Jolfaei,Philippe Fournier-Viger,Jerry Chun-Wei Lin

doi:10.1016/j.future.2020.11.026

Abstract

Privacy in the realm of data mining known as PPDM has become a hot topic in both academic research and industry due to the fact it can discover implicit rules as well as hide sensitive information for data sanitization. Many different algorithms and heuristics have been investigated to hide sensitive information using the act of transaction deletion based on evolutionary computation techniques, but to date, these algorithms only consider a uniform threshold value for sanitization progress. This technique is not applicable in real-world situations, especially for eHealth based medical datasets. For example, a patient can still be identified if he/she has more confidential information (i.e., symptoms) that cause privacy threats and security leakage in medical applications. In this work, we investigate a unique novel methodology to set varied threshold values that lead to varied lengths of sensitive patterns within a Genetic Algorithm (GA)-based framework. As the pattern length increases, a tighter threshold manifests to provide better protection of sensitive information that can avoid individual patients to be identified in eHealth datasets. Two GA-based models are developed for data sanitization using record deletion techniques. The experimental results are conducted and compared with the traditional Evolutionary Computation (EC)-based PPDM approaches and the results showed that the designed methods offer greater protection than previous methods in terms of side effects. Therefore, the designed models are effective to hide sensitive information in medical situations that can be used in real-world scenarios.

Full Text