MSTD: Moroccan Sentiment Twitter Dataset

Soukaina Mihi,Sara Arezki,Nabil Laachfoubi,Ismail El,Brahim Ait

doi:10.14569/ijacsa.2020.0111045

Soukaina Mihi, Sara Arezki + Show 3 more

Open Access

https://doi.org/10.14569/ijacsa.2020.0111045

Copy DOI

Abstract

With the proliferation of social media and Internet accessibility, a massive amount of data has been produced. In most cases, the textual data available through the web comes mainly from people expressing their views in informal words. The Arabic language is one of the hardest Semitic languages to deal with because of its complex morphology. In this paper, a new contribution to the Arabic resources is presented as a large Moroccan dataset retrieved from Twitter and carefully annotated by native speakers. For the best of our knowledge, this dataset is the largest Moroccan dataset for sentiment analysis. It is distinguished by its size, its quality given by the commitment of annotators, and its accessibility for the research community. Furthermore, the MSTD (Moroccan Sentiment Twitter Dataset) is benchmarked through experiments carried out for 4-way classification as well as polarity classification (positive, negative). Various machine-learning algorithms are combined to feature extraction techniques to reach optimal settings. This work also presents the effect of stemming and lemmatization on the improvement of the obtained accuracies.

Highlights

Natural language processing (NLP) is a very active area of research that exploits the most advanced algorithms and techniques to give machines the ability to understand human language
The Moroccan dialect, widely known as Darija is a variety of Arabic language; it is used in daily communication by Moroccan citizens, Media programs, brand pages on social media, commercial or government advertising to reach out to the general public
Following sections present various datasets and corpora produced by different research communities within the scope of work on sentiment analysis, with a distinction between three kinds of works, ones related to the Modern Standard Arabic (MSA), the datasets produced in Vernacular Arabic, and resources built on the Maghreb dialects, with an emphasis on research conducted on Tunisian, Algerian, as well as on the Moroccan colloquial languages

Summary

INTRODUCTION

Natural language processing (NLP) is a very active area of research that exploits the most advanced algorithms and techniques to give machines the ability to understand human language. Its study has become inevitable for many businesses wanting to analyze public opinions on the internet It would be almost impossible for businesses to grow without being able to monitor their presence and brand image through customer interactions. When talking about social media, it often implies colloquial forms of expression, users of these platforms create a virtual network of friends with whom they communicate in Dialectal Arabic. This generates more impact and reaches more people.

Moroccan Dialect Challenges

Twitter Challenges

Annotation Challenges

RELATED WORK

MSA Datasets

Vernacular Arabic Datasets

Maghrebian Datasets

DATASET

Data Collection and Annotation

Dataset Properties The overall collected size of the dataset is approximately

Objective negative positive sarcasm

Data Stemming and Lemmatization

EXPERIMENTATION AND RESULTS

Feature Extraction

Classification Algorithms

Stemming Techniques

Discussion

CONCLUSION AND FUTURE WORK

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2020
Citations: 7	License type: cc-by

R Discovery Prime

R Discovery Prime

MSTD: Moroccan Sentiment Twitter Dataset

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Similar Papers

Sentiment Analysis for Thai Language in Hotel Domain Using Machine Learning Algorithms
Nattawat Khamphakdee ... Pusadee Seresangtakul
Acta Informatica Pragensia | VOL. 10
Nattawat Khamphakdee, et. al.Nattawat Khamphakdee ... Pusadee Seresangtakul
10 Sep 2021
Acta Informatica Pragensia | VOL. 10

Lung disease classification using machine learning algorithms
Murat Aykanat ... Bahar Kurt
International Journal of Applied Mathematics Electronics and Computers | VOL. 8
Murat Aykanat, et. al.Murat Aykanat ... Bahar Kurt
31 Dec 2020
International Journal of Applied Mathematics Electronics and Computers | VOL. 8

The Implications of The Sociopolitical Context on Arab Teachers in Hebrew Schools

Journal of Arts and Humanities | VOL. 2

29 Jul 2013
Journal of Arts and Humanities | VOL. 2

Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications
Ashraf Elnagar ... Yasmin S Khalifa
-
Ashraf Elnagar, et. al.Ashraf Elnagar ... Yasmin S Khalifa
18 Nov 2017
18 Nov 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MSTD: Moroccan Sentiment Twitter Dataset

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications