Abstract

With the proliferation of social media and Internet accessibility, a massive amount of data has been produced. In most cases, the textual data available through the web comes mainly from people expressing their views in informal words. The Arabic language is one of the hardest Semitic languages to deal with because of its complex morphology. In this paper, a new contribution to the Arabic resources is presented as a large Moroccan dataset retrieved from Twitter and carefully annotated by native speakers. For the best of our knowledge, this dataset is the largest Moroccan dataset for sentiment analysis. It is distinguished by its size, its quality given by the commitment of annotators, and its accessibility for the research community. Furthermore, the MSTD (Moroccan Sentiment Twitter Dataset) is benchmarked through experiments carried out for 4-way classification as well as polarity classification (positive, negative). Various machine-learning algorithms are combined to feature extraction techniques to reach optimal settings. This work also presents the effect of stemming and lemmatization on the improvement of the obtained accuracies.

Highlights

  • Natural language processing (NLP) is a very active area of research that exploits the most advanced algorithms and techniques to give machines the ability to understand human language

  • The Moroccan dialect, widely known as Darija is a variety of Arabic language; it is used in daily communication by Moroccan citizens, Media programs, brand pages on social media, commercial or government advertising to reach out to the general public

  • Following sections present various datasets and corpora produced by different research communities within the scope of work on sentiment analysis, with a distinction between three kinds of works, ones related to the Modern Standard Arabic (MSA), the datasets produced in Vernacular Arabic, and resources built on the Maghreb dialects, with an emphasis on research conducted on Tunisian, Algerian, as well as on the Moroccan colloquial languages

Read more

Summary

INTRODUCTION

Natural language processing (NLP) is a very active area of research that exploits the most advanced algorithms and techniques to give machines the ability to understand human language. Its study has become inevitable for many businesses wanting to analyze public opinions on the internet It would be almost impossible for businesses to grow without being able to monitor their presence and brand image through customer interactions. When talking about social media, it often implies colloquial forms of expression, users of these platforms create a virtual network of friends with whom they communicate in Dialectal Arabic. This generates more impact and reaches more people.

Moroccan Dialect Challenges
Twitter Challenges
Annotation Challenges
RELATED WORK
MSA Datasets
Vernacular Arabic Datasets
Maghrebian Datasets
DATASET
Data Collection and Annotation
Dataset Properties The overall collected size of the dataset is approximately
Objective negative positive sarcasm
Data Stemming and Lemmatization
EXPERIMENTATION AND RESULTS
Feature Extraction
Classification Algorithms
Stemming Techniques
Discussion
CONCLUSION AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call