Abstract

Arabic dialect identification is an inherently complex problem, as Arabic dialect taxonomy is convoluted and aims to dissect a continuous space rather than a discrete one. In this work, we present machine and deep learning approaches to predict 21 fine-grained dialects form a set of given tweets per user. We adopted numerous feature extraction methods most of which showed improvement in the final model, such as word embedding, Tf-idf, and other tweet features. Our results show that a simple LinearSVC can outperform any complex deep learning model given a set of curated features. With a relatively complex user voting mechanism, we were able to achieve a Macro-Averaged F1-score of 71.84% on MADAR shared subtask-2. Our best submitted model ranked second out of all participating teams.

Highlights

  • In recent years, an extensive increase in social media platforms usages, such as Facebook and Twitter, led to an exponential growth in the userbase generated content

  • We describe our work on exploring different machine and deep learning methods in our attempt to build a classifier for user dialect identification as part of MADAR (Multi-Arabic Dialect Applications and Resources) shared subtask-2 (Bouamor et al, 2018) (Bouamor et al, 2019)

  • The task of user dialect identification can be seen as a text classification problem, where we predict the probability of a dialect given a sequence of words and other features provided by the task organizers

Read more

Summary

Introduction

An extensive increase in social media platforms usages, such as Facebook and Twitter, led to an exponential growth in the userbase generated content. The nature of this data is diverse. It comprises different expressions, languages, and dialects which attracted researchers to understand and harness language semantics such as sentiment, emotion, dialect identification, and many other Natural Language Processing (NLP) tasks. We tackle the problem of predicting the user dialect from a set of his given tweets. The task of user dialect identification can be seen as a text classification problem, where we predict the probability of a dialect given a sequence of words and other features provided by the task organizers. Besides reporting the results from different models, we show how the provided dataset for the task is not straightforward and requires additional analysis, feature engineering, and post-processing techniques

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call