On the Role of Orthographic Variations in Building Multidialectal Arabic Word Embeddings

Abdellah El Mekki,Ahmed Khoumsi,Ismail Berrada,Abdelkader El Mahdaouy

doi:10.21428/594757db.5febef29

Abstract

Dialectal Arabic (DA) is mostly used by over 400 million people across Arab countries as a communication channel on social media platforms, web forums, and daily life. Building Natural Language Processing systems for each DA variant is a challenging issue due to the lack of data and the noisy nature of the available corpora. In this paper, we propose a method to incorporate orthographic features into word embedding mapping methods, inducing a multidialectal embedding space. Our method can be used for both supervised and unsupervised cross-lingual embedding mapping approaches. The core idea of our method is to project the orthographic features into a shared vector space using Canonical Correlation Analysis (CCA). Then, it extends word embedding vectors using the resulting features and learns the multidialectal mapping. The overall obtained results of our proposed method show that our method enhances Bilingual Lexicon Induction of DA by 3.33% and 17.50% compared to state-of-the-art supervised and unsupervised cross-lingual alignment methods, respectively.

Full Text