Survey on RNN and CRF models for de-identification of medical free text

Joffrey L Leevy,Taghi M Khoshgoftaar,Flavio Villanustre

doi:10.1186/s40537-020-00351-4

Joffrey L Leevy, Taghi M Khoshgoftaar + Show 1 more

Open Access

https://doi.org/10.1186/s40537-020-00351-4

Copy DOI

Journal: Journal of Big Data	Publication Date: Sep 4, 2020
Citations: 27	License type: open-access

Affiliation: Florida Atlantic University

Abstract

The increasing reliance on electronic health record (EHR) in areas such as medical research should be addressed by using ample safeguards for patient privacy. These records often tend to be big data, and given that a significant portion is stored as free (unstructured) text, we decided to examine relevant work on automated free text de-identification with recurrent neural network (RNN) and conditional random field (CRF) approaches. Both methods involve machine learning and are widely used for the removal of protected health information (PHI) from free text. The outcome of our survey work produced several informative findings. Firstly, RNN models, particularly long short-term memory (LSTM) algorithms, generally outperformed CRF models and also other systems, namely rule-based algorithms. Secondly, hybrid or ensemble systems containing joint LSTM-CRF models showed no advantage over individual LSTM and CRF models. Thirdly, overfitting may be an issue when customized de-identification datasets are used during model training. Finally, statistical validation of performance scores and diversity during experimentation were largely ignored. In our comprehensive survey, we also identify major research gaps that should be considered for future work.

Highlights

As the use and volume of medical records continues to rapidly grow in various areas, including research, there is a growing need to safeguard patient privacy for ethical and legal reasons [1]
The proliferation of electronic health record (EHR) in various areas such as medical research should be covered by adequate protections for patient privacy
Since a considerable number of medical records are stored as free text, we decided to do a survey on automated free text de-identification

Summary

Introduction

As the use and volume of medical records continues to rapidly grow in various areas, including research, there is a growing need to safeguard patient privacy for ethical and legal reasons [1]. RNNs have cyclic or feedback connections that facilitate updates to their current state based on previous states and current inputs This makes RNNs more efficient in sequence modeling tasks. A lot of research work in this area is centered on generative models, such as the hidden Markov model (HMM) [26], which characterize a joint probability distribution over observed input features x and corresponding annotated outputs y Modeling this distribution means that all possible values of x must be used, an often intractable operation due to the high dimensionality of x. If the proposal integrates both RNN and CRF algorithms into one model, the work is included both in “Conditional random fields: methods for medical free text fe-Identification” section, which discusses de-identification of medical free text with CRFs, and this section. Dernoncourt et al 2017 [33] Jiang et al 2017 [38] Kajiyama et al 2018 [39] Kim et al 2018 [40] Lee et al 2016 [41] Lee et al 2019 [42] Liu et al 2017 [43] Madan et al 2018 [44] Richter et al 2019 [45] Shweta et al 2016 [46] Trienes et al 2020 [47] Yang et al 2019 [48]

Findings

Records Tokens Tokens per record PHI tags PHI per record

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Survey on RNN and CRF models for de-identification of medical free text

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Comprehensive Word-Level Classification of Screening Mammography Reports Using a Neural Network Sequence Labeling Approach.
Ryan G Short ... John Bralich
Journal of digital imaging | VOL. 32
Ryan G Short, et. al.Ryan G Short ... John Bralich
18 Oct 2018
Journal of digital imaging | VOL. 32

A Short Survey of LSTM Models for De-identification of Medical Free Text
Joffrey L Leevy ... Taghi M Khoshgoftaar
-
Joffrey L Leevy, et. al.Joffrey L Leevy ... Taghi M Khoshgoftaar
01 Dec 2020
01 Dec 2020

Sediment load forecasting of Gobindsagar reservoir using machine learning techniques
Nadeem Shaukat ... Abrar Hashmi
Frontiers in Earth Science | VOL. 10
Nadeem Shaukat, et. al.Nadeem Shaukat ... Abrar Hashmi
15 Dec 2022
Frontiers in Earth Science | VOL. 10

Response Prediction for Linear and Nonlinear Structures Based on Data-Driven Deep Learning
Yangyang Liao ... Hesheng Tang
Applied Sciences | VOL. 13
Yangyang Liao, et. al.Yangyang Liao ... Hesheng Tang
11 May 2023
Applied Sciences | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Survey on RNN and CRF models for de-identification of medical free text

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data