Abstract
In order to accurately represent the meaning and pronunciation of Arabic words and sentences, the presence of diacritics plays a crucial role. Over the years, researchers have dedicated significant efforts to enhancing automated diacritization systems. This paper introduces a novel approach for Arabic diacritization utilizing Bidirectional Encoder representations from Transformers (BERT) models. To evaluate the effectiveness of the proposed approach, two publicly available datasets, namely the Arabic Diacritization (AD) dataset and the Tashkeela Processed (TP) dataset, were employed. The performance of the models was assessed using various error metrics, including Diacritic Error Rate (DER) and Word Error Rate (WER). The findings demonstrate the superior performance of BERT in the diacritization process, surpassing all models employed in other diacritization systems. On the AD dataset, the proposed system achieved state-of-the-art (SOTA) syntactic DER and WER of 1.14% and 3.34%, respectively. For morphological diacritization, the best results yielded a DER of 0.92% and a WER of 1.91%. These outcomes reflect a remarkable relative error reduction of over 30% compared to previous research. Additionally, on the TP dataset, the BERT models exhibited a substantial decrease in DER, reducing the benchmark from 4.0% to 1.11%. Furthermore, this study introduces a real-time diacritization system called SUKOUN, which offers diacritized text through a user-friendly website. A comparison with existing automatic diacritization tools, using six example texts, reveals the superior prediction accuracy and preservation of input format provided by SUKOUN.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.