Correcting Arabic Soft Spelling Mistakes Using Transformers

Mohammed Al-Qaraghuli,Ashraf Suyyagh,Gheith Abandah

doi:10.1109/jeeit53412.2021.9634142

Abstract

Spelling mistakes are a common issue in the Arabic language; several techniques have been used to solve this issue such as confusion matrices, language models, and neural networks. In recent years, a neural network called the Transformer has been introduced as a machine translation model. Since then, the transformer and its variants has become a very popular solution for most of the Natural Language Processing (NLP) tasks. In this paper, we aim to use the transformer to correct four types of Arabic soft spelling mistakes, namely, confusion among various shapes of hamza, shapes of alef at the end of the word, insertion and omission of alef after waw aljamaea, and confusion among teh, teh marbuta, and heh at the end of the word. We used artificial errors for training and evaluation; these errors were generated using an approach called stochastic error injection. The best model we trained was able to correct 97.37% of the artificial errors that were injected into the test set and achieved a character error rate (CER) of 0.86% on a set that contains real soft spelling mistakes.

Full Text