A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records.

Rosario Catelli,Francesco Gargiulo,Valentina Casola,Giuseppe De Pietro,Massimo Esposito,Hamido Fujita

doi:10.1109/access.2021.3054479

Abstract

In the last years, the need to de-identify privacy-sensitive information within Electronic Health Records (EHRs) has become increasingly felt and extremely relevant to encourage the sharing and publication of their content in accordance with the restrictions imposed by both national and supranational privacy authorities. In the field of Natural Language Processing (NLP), several deep learning techniques for Named Entity Recognition (NER) have been applied to face this issue, significantly improving the effectiveness in identifying sensitive information in EHRs written in English. However, the lack of data sets in other languages has strongly limited their applicability and performance evaluation. To this aim, a new de-identification data set in Italian has been developed in this work, starting from the 115 COVID-19 EHRs provided by the Italian Society of Radiology (SIRM): 65 were used for training and development, the remaining 50 were used for testing. The data set was labelled following the guidelines of the i2b2 2014 de-identification track. As additional contribution, combined with the best performing Bi-LSTM + CRF sequence labeling architecture, a stacked word representation form, not yet experimented for the Italian clinical de-identification scenario, has been tested, based both on a contextualized linguistic model to manage word polysemy and its morpho-syntactic variations and on sub-word embeddings to better capture latent syntactic and semantic similarities. Finally, other cutting-edge approaches were compared with the proposed model, which achieved the best performance highlighting the goodness of the promoted approach.

Highlights

In recent years, the availability of textual clinical data in electronic form, known as Electronic Health Records (EHRs), from which further information can be extracted to manage various critical health situations has become increasingly important
QUALITATIVE ANALYSIS The Bidirectional Long Short-Term Memory (Bi-LSTM) + Conditional Random Field (CRF) model with the proposed stacked embedding made by FastText plus Flair works both at the sub-word level and at the character level exploiting the context: the results show that this proposed stacked embedding is effective in improving the ability to detect and classify entities
In this study, a novel Italian data set was proposed for a challenging Named Entity Recognition (NER) task, i.e. clinical de-identification

Summary

Introduction

The availability of textual clinical data in electronic form, known as Electronic Health Records (EHRs), from which further information can be extracted to manage various critical health situations has become increasingly important. A PHI can be assimilated to a named entity The recognition of such entities occurs by implementing what is called NER, defined as clinical if applied on medical records in the form of unstructured text. The purpose is to be able to use the data contained in them, it is necessary to identify the PHI and replace them with valid surrogates, a process called anonymisation [8]. For this reason it is important to recognize the type to which the entity belongs, and it would be more correct to refer to Named Entity Recognition and Classification (NERC). These promising deep learning systems have been started being applied to other languages different from English

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE access : practical innovations, open solutions	Publication Date: Jan 1, 2021
Citations: 74	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions

Lead the way for us

Similar Papers

Chinese medical entity recognition based on the dual-branch TENER model.
Hui Peng ... Xiaohui Qin
BMC Medical Informatics and Decision Making | VOL. 23
Hui Peng, et. al.Hui Peng ... Xiaohui Qin
24 Jul 2023
BMC Medical Informatics and Decision Making | VOL. 23

Med7: A transferable clinical natural language processing model for electronic health records
Andrey Kormilitzin ... Alejo Nevado-Holgado
Artificial Intelligence in Medicine | VOL. 118
Andrey Kormilitzin, et. al.Andrey Kormilitzin ... Alejo Nevado-Holgado
18 May 2021
Artificial Intelligence in Medicine | VOL. 118

An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation.
Peng Wang ... Shaopei Long
JMIR medical informatics | VOL. 10
Peng Wang, et. al.Peng Wang ... Shaopei Long
30 Aug 2022
JMIR medical informatics | VOL. 10

Applications of Natural Language Processing in Clinical Research and Practice
Yanshan Wang ... Ahmad Tafti
-
Yanshan Wang, et. al.Yanshan Wang ... Ahmad Tafti
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions