Automatic Correction of Real-Word Errors in Spanish Clinical Texts.

Daniel Bravo-Candel,Jésica López-Hernández,José Antonio García-Díaz,Fernando Molina-Molina,Francisco García-Sánchez

doi:10.3390/s21092893

Daniel Bravo-Candel, Jésica López-Hernández + Show 3 more

Open Access

https://doi.org/10.3390/s21092893

Copy DOI

Journal: Sensors (Basel, Switzerland)	Publication Date: Apr 21, 2021
Citations: 9	License type: CC BY 4.0

Affiliation: University of Murcia

Abstract

Real-word errors are characterized by being actual terms in the dictionary. By providing context, real-word errors are detected. Traditional methods to detect and correct such errors are mostly based on counting the frequency of short word sequences in a corpus. Then, the probability of a word being a real-word error is computed. On the other hand, state-of-the-art approaches make use of deep learning models to learn context by extracting semantic features from text. In this work, a deep learning model were implemented for correcting real-word errors in clinical text. Specifically, a Seq2seq Neural Machine Translation Model mapped erroneous sentences to correct them. For that, different types of error were generated in correct sentences by using rules. Different Seq2seq models were trained and evaluated on two corpora: the Wikicorpus and a collection of three clinical datasets. The medicine corpus was much smaller than the Wikicorpus due to privacy issues when dealing with patient information. Moreover, GloVe and Word2Vec pretrained word embeddings were used to study their performance. Despite the medicine corpus being much smaller than the Wikicorpus, Seq2seq models trained on the medicine corpus performed better than those models trained on the Wikicorpus. Nevertheless, a larger amount of clinical text is required to improve the results.

Highlights

Clinical notes often contain spelling errors due to time and efficiency pressure
The seq2seq model was evaluated with the medicine corpus and the Wikicorpus
Real-word errors can affect the performance of automatic text processing tools, including decision support systems and recommender systems

Summary

Introduction

Clinical notes often contain spelling errors due to time and efficiency pressure. Among abbreviations, punctuation errors and other types of noise, misspellings hinder text processing tasks for knowledge extraction such as term disambiguation and named entities recognition. Spelling detection and correction are considered from the perspective of non-real words and real words [3]. The former is concerned about misspellings that result in non-existent words, e.g., ‘graffe’ for ‘giraffe’. These errors are usually detected by looking for them in the dictionary, and corrected by calculating the edit distance from similar words [4]. Grammatical errors are considered as real-word errors. In this case, the use of dictionaries is an invalid approach and real words must be analyzed regarding their context

Results

Discussion

Conclusion