Abstract

In order to accurately represent the meaning and pronunciation of Arabic words and sentences, the presence of diacritics plays a crucial role. Over the years, researchers have dedicated significant efforts to enhancing automated diacritization systems. This paper introduces a novel approach for Arabic diacritization utilizing Bidirectional Encoder representations from Transformers (BERT) models. To evaluate the effectiveness of the proposed approach, two publicly available datasets, namely the Arabic Diacritization (AD) dataset and the Tashkeela Processed (TP) dataset, were employed. The performance of the models was assessed using various error metrics, including Diacritic Error Rate (DER) and Word Error Rate (WER). The findings demonstrate the superior performance of BERT in the diacritization process, surpassing all models employed in other diacritization systems. On the AD dataset, the proposed system achieved state-of-the-art (SOTA) syntactic DER and WER of 1.14% and 3.34%, respectively. For morphological diacritization, the best results yielded a DER of 0.92% and a WER of 1.91%. These outcomes reflect a remarkable relative error reduction of over 30% compared to previous research. Additionally, on the TP dataset, the BERT models exhibited a substantial decrease in DER, reducing the benchmark from 4.0% to 1.11%. Furthermore, this study introduces a real-time diacritization system called SUKOUN, which offers diacritized text through a user-friendly website. A comparison with existing automatic diacritization tools, using six example texts, reveals the superior prediction accuracy and preservation of input format provided by SUKOUN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call