Abstract

The rise of social media has contributed to the widespread of the Arabizi writing form, primarily used in colloquial communication. For Natural Language Processing (NLP) tools, processing texts in this form remains challenging due to the lack of suitable language resources. Additionally, there is a lack of standardized rules in the transliteration mapping between Arabizi and Arabic, resulting in variations across different dialectal groups. To address these limitations in the context of Moroccan Darija (MD), this work proposes a method for converting Arabizi to Arabic at the word level. This method involves a sequential combination of rule-based transliteration and weighted Levenshtein algorithm. The contributions of this approach include: (i) Building a large MD dataset that incorporates texts reflecting the characteristics of MD and the colloquial writing forms usually used in the Arabizi writing form. (ii) Generating transliteration rules tailored to MD. (iii) Adapting the edit costs within the weighted Levenshtein algorithm to enhance conversion performance. Successful tests have been conducted and the approach was applied to three datasets: two state-of-the-art Darija-Modern Standard Arabic (MSA) datasets and the MD dataset collected as part of this work. The proposed method achieved a Mean Reciprocal Rank (MRR) of 92.14% and an accuracy of 88.44%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call