De-identifying free text of Japanese electronic health records

Kohei Kajiyama,Takashi Okumura,Yoshinobu Kano,Hiromasa Horiguchi,Mizuki Morita

doi:10.1186/s13326-020-00227-9

Abstract

BackgroundRecently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset.ResultsUsing three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually for age, hospital, person, sex, and time. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR.ConclusionsOur LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance.Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals.

Highlights

More electronic data sources are becoming available in the healthcare domain
We implemented three de-identification methods for Japanese electronic health record (EHR) and applied these methods to three datasets, which are derived from two dummy EHR sources and one real Pathology Report dataset
Our best F1-scores over all the tag types are 84.23, 68.19 (LSTM), and 81.67 (LSTM) points, respectively, for the MedNLP dataset, the dummy EHR dataset, and the Pathology Report dataset

Summary

Introduction

More electronic data sources are becoming available in the healthcare domain. Utilization of Kajiyama et al Journal of Biomedical Semantics (2020) 11:11 further restricts the use of personal identification codes including individual numbers (e.g. health insurance card numbers, driver’s license card numbers, and governmental personnel numbers), biometric information (e.g. fingerprints, DNA, voice, and appearances), and information related to disability. This legislation can be compared with the “Health Insurance Portability and Accountability Act (HIPAA) [2]” of the United States, in that the Japanese Act in 2017 includes additional codes, with abstract specifications such as “you should strive not to discriminate or impose improper burdens,” and with exclusion of birth dates and criminal histories, as stipulated by HIPAA. De-identification of unstructured data in EHRs is necessary, it is virtually impossible to de-identify the huge number of documents manually

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Biomedical Semantics	Publication Date: Sep 21, 2020
Citations: 3	License type: open-access

R Discovery Prime

R Discovery Prime

De-identifying free text of Japanese electronic health records

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Biomedical Semantics

Lead the way for us

Similar Papers

Adaptability of machine learning methods and hydrological models to discharge simulations in data-sparse glaciated watersheds
Huiping Ji ... Yaning Chen
Journal of Arid Land | VOL. 13
Huiping Ji, et. al.Huiping Ji ... Yaning Chen
22 May 2021
Journal of Arid Land | VOL. 13

Incorporating Empirical Orthogonal Function Analysis into Machine Learning Models for Streamflow Prediction
Yajie Wu ... Yuan Chen
Sustainability | VOL. 14
Yajie Wu, et. al.Yajie Wu ... Yuan Chen
28 May 2022
Sustainability | VOL. 14

Prediction of Sepsis from Clinical Data Using Long Short-Term Memory and eXtreme Gradient Boosting
...
-
, et. al. ...
30 Dec 2020
30 Dec 2020

P1923Deep and machine learning models to improve risk prediction of cardiovascular disease using data extraction from electronic health records
I Korsakov ... T Kuznetsova
European Heart Journal | VOL. 40
I Korsakov, et. al.I Korsakov ... T Kuznetsova
01 Oct 2019
European Heart Journal | VOL. 40

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

De-identifying free text of Japanese electronic health records

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Biomedical Semantics