De-identifying Spanish medical texts - named entity recognition applied to radiology reports

Irene Pérez-Díez,Adolfo López-Cerdán,María De La Iglesia-Vayá,Raúl Pérez-Moraga,Jose-Maria Salinas-Serrano

doi:10.1186/s13326-021-00236-2

Abstract

BackgroundMedical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages.ResultsWe tested 4 neural networks on our radiology reports dataset, achieving a recall of 97.18% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, achieving a recall of 69.18%.ConclusionsThe strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it could be easily extended to other languages and medical texts, such as electronic health records.

Highlights

Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers
Named entities Given that there is no specific guidance in the Spanish legal system on what information has to be removed to de-identify medical texts, we decided to assess the presence in our corpus of the Protected Health Information (PHI) categories defined by the Health Insurance Portability and Accountability Act (HIPAA) in the United States of America [31]
Considering that the recall metric assesses the capability to avoid the leakage of sensitive information of a model, we propose long short-term memory units (LSTM)-LSTM-conditional random fields (CRF) with Exponential Moving Average (EMA) as the best neural network to address a de-identification task based on Named Entity Recognition (NER)

Summary

Introduction

Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Currently there are several anonymization strategies for the English language, they are language-dependent. We introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages. Data from radiology reports, electronic health records and other medical texts such as clinical trial protocols are being used for research purposes [1, 2]. Researchers and patients can greatly benefit from these datasets. In Spain, the Organic Law 3/2018 [4] establishes the legal framework for data pro-

Objectives

Methods

Results

Discussion

Conclusion