Abstract

In this paper, we build different word embedding representations for code-mixed Moroccan Darija-French-English text. We begin by collecting a large corpus based on social media posts and song lyrics. We then experiment with four algorithms: namely, Word2Vec using both Continuous Bag of Words (CBOW) and SkipGram flavors, FastText-CBOW and FastText-SkipGram. Finally, we evaluate these models on two tasks: Word Analogies and Word-Level Language Detection. The use of character n-grams in the FastText models allowed them to perform much better than their word2vec counterparts on the word analogy task, while Word2Vec-SkipGram performed best on the language identification. To the knowledge of the authors, our embeddings are the first set of distributed representations for Arabizi Moroccan Darija text. In the context of this work, we also managed to build the first ever code-switched English – French - Moroccan Darija word analogy dataset and the largest Arabizi code-switched English – French - Moroccan Darija.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call