Abstract

With the development of big data and cloud computing technologies, the importance of pseudonym information has grown. However, the tools for verifying whether the de-identification methodology is correctly applied to ensure data confidentiality and usability are insufficient. This paper proposes a verification of de-identification techniques for personal healthcare information by considering data confidentiality and usability. Data are generated and preprocessed by considering the actual statistical data, personal information datasets, and de-identification datasets based on medical data to represent the de-identification technique as a numeric dataset. Five tree-based regression models (i.e., decision tree, random forest, gradient boosting machine, extreme gradient boosting, and light gradient boosting machine) are constructed using the de-identification dataset to effectively discover nonlinear relationships between dependent and independent variables in numerical datasets. Then, the most effective model is selected from personal information data in which pseudonym processing is essential for data utilization. The Shapley additive explanation, an explainable artificial intelligence technique, is applied to the most effective model to establish pseudonym processing policies and machine learning to present a machine-learning process that selects an appropriate de-identification methodology.

Highlights

  • With the recent development of big data and cloud computing technologies, numerous data, including personal information, have been generated in digital environments [1]

  • This paper proposes a verification approach for de-identification methodologies based on the decision tree (DT)-based ensemble learning methods with Shapley additive explanation (SHAP) values

  • Training and testing of the machine-learning model were conducted in the following environment: an LG Electronics laptop with an Intel(R) core (TM) i3-6100U CPU at 2.30

Read more

Summary

Introduction

With the recent development of big data and cloud computing technologies, numerous data, including personal information, have been generated in digital environments [1]. The failure to conduct the proper procedures for converting personal information into pseudonym information could cause serious privacy problems [5,7,8]. An appropriate de-identification process is required to convert personal information into pseudonym information [3,9]. Various studies have been conducted to effectively perform de-identification or develop open-source tools for de-identifying personal information, such as automatic retransmission exchange [10,11]. Few methods evaluate whether de-identified data (including alias processing) can be properly identified. Technology for producing pseudonym information must be transformed for the intended purposes

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call