Abstract
With the proliferation of social media and Internet accessibility, a massive amount of data has been produced. In most cases, the textual data available through the web comes mainly from people expressing their views in informal words. The Arabic language is one of the hardest Semitic languages to deal with because of its complex morphology. In this paper, a new contribution to the Arabic resources is presented as a large Moroccan dataset retrieved from Twitter and carefully annotated by native speakers. For the best of our knowledge, this dataset is the largest Moroccan dataset for sentiment analysis. It is distinguished by its size, its quality given by the commitment of annotators, and its accessibility for the research community. Furthermore, the MSTD (Moroccan Sentiment Twitter Dataset) is benchmarked through experiments carried out for 4-way classification as well as polarity classification (positive, negative). Various machine-learning algorithms are combined to feature extraction techniques to reach optimal settings. This work also presents the effect of stemming and lemmatization on the improvement of the obtained accuracies.
Highlights
Natural language processing (NLP) is a very active area of research that exploits the most advanced algorithms and techniques to give machines the ability to understand human language
The Moroccan dialect, widely known as Darija is a variety of Arabic language; it is used in daily communication by Moroccan citizens, Media programs, brand pages on social media, commercial or government advertising to reach out to the general public
Following sections present various datasets and corpora produced by different research communities within the scope of work on sentiment analysis, with a distinction between three kinds of works, ones related to the Modern Standard Arabic (MSA), the datasets produced in Vernacular Arabic, and resources built on the Maghreb dialects, with an emphasis on research conducted on Tunisian, Algerian, as well as on the Moroccan colloquial languages
Summary
Natural language processing (NLP) is a very active area of research that exploits the most advanced algorithms and techniques to give machines the ability to understand human language. Its study has become inevitable for many businesses wanting to analyze public opinions on the internet It would be almost impossible for businesses to grow without being able to monitor their presence and brand image through customer interactions. When talking about social media, it often implies colloquial forms of expression, users of these platforms create a virtual network of friends with whom they communicate in Dialectal Arabic. This generates more impact and reaches more people.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have