Abstract
In Arabic natural language processing (NLP), automatic text diacritization is a major obstacle, and progress has been slow when compared to other language processing tasks. Automatic diacritical marking of Arabic text is proposed in this work using the first transformer-based paradigm designed solely for this task. By taking advantage of the attention mechanism, our system is able to capture more of the innate patterns in Arabic, surpassing the performance of both rule-based alternatives and neural network techniques. The model trained with the Clean-50 dataset had a diacritic error rate (DER) of 2.03%, even though the model trained with the Clean-400 dataset had a DER of 1.37%. As compared to state-of-the-art results, the improvement for the Clean-50 dataset is minimal. However, for the larger Clean-400 dataset, it is a notable improvement, indicating that this approach can deliver more accurate solutions for applications requiring precise diacritical marks with larger datasets. Additionally, this method achieves a DER of 1.21% for the Clean-400 dataset, and it performs even better when given extended input text with overlapping windows.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have