Romanized Tunisian dialect transliteration using sequence labelling techniques

Jihene Younes,Hadhemi Achour,Emna Souissi,Ahmed Ferchichi

doi:10.1016/j.jksuci.2020.03.008

Jihene Younes, Hadhemi Achour + Show 2 more

Open Access

https://doi.org/10.1016/j.jksuci.2020.03.008

Copy DOI

Abstract

In recent years, social web users in Arabic countries have been resorting to the dialects as a written language in their social exchanges. Arabic dialects derive from modern standard Arabic (MSA) and differ significantly from one country to another and one region to another. The use of these dialects has led to an increase of interest in the specificities of such informal languages and their automatic processing within the NLP community. In this work, we deal with the Tunisian dialect (TD) in particular. We address the issue of the automatic Latin to Arabic transliteration of TD language productions on the social web and propose an approach that models the transliteration as a sequence labeling task. At a word level, several techniques, based on machine and deep learning, have been tested for this study, using real word messages extracted from social networks. We experiment and compare three transliteration models: A Conditional Random Fields-based model (CRF), a Bidirectional Long Short-Term Memory based model (BLSTM), and a BLSTM based model with CRF decoding (BLSTM-CRF). The obtained results show that BLSTM-CRF, leads to the best performance, reaching 96.78% of correctly transliterated words. We also evaluate the BLSTM-CRF transliteration approach in context on a set of random TD messages extracted from the social web. We obtained a total error rate of 2.7%. 25% of which are context errors.

Full Text