Building Bi-script Language Resources for the Tunisian Dialect’s NLP

Jihene Younes,Hadhemi Achour,Emna Souissi,Ahmed Ferchichi

doi:10.1016/j.procs.2021.05.101

Abstract

Language resources like corpora, lexicons and dictionaries are the key element to automatically process any natural language. In this paper, we focus on the written Tunisian dialect (TD) which is abundantly present on social media and yet still qualified as a low-resource language. We automatically construct sizable bi-script TD language resources from the social web using two deep learning based-NLP components for the TD identification and the TD transliteration. We target the Latin transcription of the language and we use it to generate TD language resources written in the Arabic script. The presented work resulted in creating a Romanized TD corpus composed of 284,894 TD messages extracted from YouTube, an Arabic TD corpus generated through an automatic Latin to Arabic transliteration of the extracted Romanized TD corpus, and two bi-script TD dictionaries (Latin-to-Arabic and Arabic-to-Latin), respectively composed of 293,570 entries and 155,954 entries. The creation process and the constructed TD resources are described and evaluated.

Full Text