Combining contextualized word representation and sub-document level analysis through Bi-LSTM+CRF architecture for clinical de-identification

Rosario Catelli,Valentina Casola,Giuseppe De Pietro,Hamido Fujita,Massimo Esposito

doi:10.1016/j.knosys.2020.106649

Abstract

Clinical de-identification aims to identify Protected Health Information in clinical data, enabling data sharing and publication. First automatic de-identification systems were based on rules or on machine learning methods, limited by language changes, lack of context awareness and time consuming feature engineering. Newer deep learning techniques for sequence labeling have shown better results with a reduction in feature engineering efforts and the use of word representation techniques in vector space. However, they are not able to jointly represent the polysemic and context-dependent nature of words, as well as their morpho-syntactic mutations characteristic of handwriting. To address these limitations, a new de-identification approach based on deep learning techniques for Named Entity Recognition has been proposed, whose key factors are: (i) a Bidirectional Long Short-Term Memory + Conditional Random Field architecture for sequence labeling that takes advantage of the widest possible representation context; (ii) a contextualized language model, working at character level, to capture the polysemy of words and manage the morpho-syntactic variations typical of handwritten notes; (iii) more word representations stacked to better capture latent syntactic and semantic similarities. This approach has been tested on the official Informatics for Integrating Biology & the Bedside 2014 de-identification dataset, showing similar or higher performance than state of the art with respect to category and binary recognition, but without any feature engineering or handcrafted rules. The experiments demonstrate the effectiveness of the proposed approach, in particular with regard to category level recognition which is essential to correctly replace entities with surrogates for anonymization purposes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Combining contextualized word representation and sub-document level analysis through Bi-LSTM+CRF architecture for clinical de-identification

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems

Lead the way for us

Journal: Knowledge-Based Systems	Publication Date: Dec 24, 2020
Citations: 41

Similar Papers

Sentence similarity measuring by vector space model
U. L. D. N. Gunasinghe ... A. S. Perera
-
U. L. D. N. Gunasinghe, et. al.U. L. D. N. Gunasinghe ... A. S. Perera
01 Dec 2014
01 Dec 2014

Predicting Semantic Similarity Between Clinical Sentence Pairs Using Transformer Models: Evaluation and Representational Analysis.
Mark Ormerod ... Jesús Martínez Del Rincón
JMIR Medical Informatics | VOL. 9
Mark Ormerod, et. al.Mark Ormerod ... Jesús Martínez Del Rincón
26 May 2021
JMIR Medical Informatics | VOL. 9

LSTM Based Short Message Service (SMS) Modeling for Spam Classification
Hans Raj ... Santosh Kumar Banbhrani
-
Hans Raj, et. al.Hans Raj ... Santosh Kumar Banbhrani
19 May 2018
19 May 2018

Machine learning in pain research.
Jörn Lötsch ... Alfred Ultsch
Pain | VOL. 159
Jörn Lötsch, et. al.Jörn Lötsch ... Alfred Ultsch
24 Nov 2017
Pain | VOL. 159

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combining contextualized word representation and sub-document level analysis through Bi-LSTM+CRF architecture for clinical de-identification

Abstract

Talk to us

Similar Papers

More From: Knowledge-Based Systems