Abstract

The evolution of information and communication technology has markedly influenced communication between correspondents. This evolution has facilitated the transmission of information and has engendered new forms of written communication (email, chat, SMS, comments, etc.). Most of these messages and comments are written in Latin script, also called Arabizi . Moreover, the language used in social media and SMS messaging is characterized by the use of informal and non-standard vocabulary, such as repeated letters for emphasis, typos, non-standard abbreviations, and nonlinguistic content like emoticons. Since the Tunisian dialect suffers from the unavailability of basic tools and linguistic resources compared to Modern Standard Arabic, we resort to the use of these written sources as a starting point to build large corpora automatically. In the context of natural language processing and to benefit from these networks’ data, transliterating from Arabizi to Arabic script is a necessary step because most recently available tools for processing the Tunisian dialect expect Arabic script input. Indeed, the transliteration task can help construct and enrich parallel corpora and dictionaries for the Tunisian dialect and can be useful for developing various natural language processing applications such as sentiment analysis, opinion mining, topic detection, and machine translation. In this article, we focus on converting the Tunisian dialect text that is written in Latin script to Arabic script following the Conventional Orthography for Dialectal Arabic. Then, we propose two models to transliterate Arabizi into Arabic script for the Tunisian dialect, namely a rule-based model and a discriminative model as a sequence classification task based on conditional random fields). In the first model, we use a set of transliteration rules to convert the Tunisian dialect Arabizi texts to Arabic script. In the second model, transliteration is performed both at word and character levels. In the end, our models got a character error rate of 10.47%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call