Abstract

The increasing reliance on electronic health record (EHR) in areas such as medical research should be addressed by using ample safeguards for patient privacy. These records often tend to be big data, and given that a significant portion is stored as free (unstructured) text, we decided to examine relevant work on automated free text de-identification with recurrent neural network (RNN) and conditional random field (CRF) approaches. Both methods involve machine learning and are widely used for the removal of protected health information (PHI) from free text. The outcome of our survey work produced several informative findings. Firstly, RNN models, particularly long short-term memory (LSTM) algorithms, generally outperformed CRF models and also other systems, namely rule-based algorithms. Secondly, hybrid or ensemble systems containing joint LSTM-CRF models showed no advantage over individual LSTM and CRF models. Thirdly, overfitting may be an issue when customized de-identification datasets are used during model training. Finally, statistical validation of performance scores and diversity during experimentation were largely ignored. In our comprehensive survey, we also identify major research gaps that should be considered for future work.

Highlights

  • As the use and volume of medical records continues to rapidly grow in various areas, including research, there is a growing need to safeguard patient privacy for ethical and legal reasons [1]

  • The proliferation of electronic health record (EHR) in various areas such as medical research should be covered by adequate protections for patient privacy

  • Since a considerable number of medical records are stored as free text, we decided to do a survey on automated free text de-identification

Read more

Summary

Introduction

As the use and volume of medical records continues to rapidly grow in various areas, including research, there is a growing need to safeguard patient privacy for ethical and legal reasons [1]. RNNs have cyclic or feedback connections that facilitate updates to their current state based on previous states and current inputs This makes RNNs more efficient in sequence modeling tasks. A lot of research work in this area is centered on generative models, such as the hidden Markov model (HMM) [26], which characterize a joint probability distribution over observed input features x and corresponding annotated outputs y Modeling this distribution means that all possible values of x must be used, an often intractable operation due to the high dimensionality of x. If the proposal integrates both RNN and CRF algorithms into one model, the work is included both in “Conditional random fields: methods for medical free text fe-Identification” section, which discusses de-identification of medical free text with CRFs, and this section. Dernoncourt et al 2017 [33] Jiang et al 2017 [38] Kajiyama et al 2018 [39] Kim et al 2018 [40] Lee et al 2016 [41] Lee et al 2019 [42] Liu et al 2017 [43] Madan et al 2018 [44] Richter et al 2019 [45] Shweta et al 2016 [46] Trienes et al 2020 [47] Yang et al 2019 [48]

Findings
Records Tokens Tokens per record PHI tags PHI per record
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call