An Annotated Huge Dataset for Standard and Colloquial Arabic Reviews for Subjective Sentiment Analysis

Ashraf Elnagar,Leena Lulu,Omar Einea

doi:10.1016/j.procs.2018.10.474

Ashraf Elnagar, Leena Lulu + Show 1 more

Open Access

https://doi.org/10.1016/j.procs.2018.10.474

Copy DOI

Journal: Procedia Computer Science	Publication Date: Jan 1, 2018
Citations: 40	License type: cc-by-nc-nd

Affiliation: University of Sharjah, Ajman University

Abstract

Sentiment analysis is getting increasingly popular as it facilitates gaining an indication of the wider public opinions or attitudes towards certain products, services, articles, etc. Many researchers have shown considerable interest in this field. Most of these studies have focused on English and other Indo-European languages. Very few studies have addressed the problem for the Arabic language. This is, mostly, due to the rare or nonexistent huge and free Arabic datasets that contains both Modern Standard Arabic (MSA) as well as Dialectal Arabic (DA). Generally, one of the main challenges for developing robust sentiment analysis systems is the availability of such large-scale datasets. Such datasets exist in abundance for English language, while it is not the case for a low-resource language such as the Arabic language. Recently, there have been some efforts for providing relatively large-scale Arabic datasets dedicated for sentiment analysis such as LABR and most recently BRAD 1.0, which is considered as the largest Arabic Book Reviews dataset for sentiment analysis and machine learning applications. In this work, we present BRAD 2.0, an extension to BRAD 1.0 with more than 200K extra records to account for several Arabic dialects. BRAD 2.0 has a total number of 692586 annotated reviews; each represents a single review along with the reviewer’s rating ranging from 1 to 5 of a certain book. The most interesting property of BRAD 2.0 is that it combines both MSA and DA. To verify and validate the proposed dataset, we implement several state-of-the-art supervised and unsupervised classifiers to categorize book reviews. For the unsupervised classifiers, we implemented several models of CNN and RNN classifiers utilizing GloVe-based word embeddings. Although all classifiers performed well, the highest accuracies attained are between 90% and 91%. Experimental results show that BRAD 2.0 is rich and robust. Our key contribution is to make this benchmark-dataset available and accessible to promote further research in the field of Arabic computational linguistic.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Annotated Huge Dataset for Standard and Colloquial Arabic Reviews for Subjective Sentiment Analysis

Abstract

Talk to us

Similar Papers

More From: Procedia Computer Science

Lead the way for us

Similar Papers

Hotel Arabic-Reviews Dataset Construction for Sentiment Analysis Applications
Ashraf Elnagar ... Yasmin S Khalifa
-
Ashraf Elnagar, et. al.Ashraf Elnagar ... Yasmin S Khalifa
18 Nov 2017
18 Nov 2017

Attention Mechanism Architecture for Arabic Sentiment Analysis
Mohamed Berrimi ... Mohamed Saidi
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22
Mohamed Berrimi, et. al.Mohamed Berrimi ... Mohamed Saidi
24 Mar 2023
ACM Transactions on Asian and Low-Resource Language Information Processing | VOL. 22

Investigation on sentiment analysis for Arabic reviews
Ashraf Elnagar
-
Ashraf ElnagarAshraf Elnagar
01 Nov 2016
01 Nov 2016

SAHAR-LSTM: An enhanced Model for Sentiment Analysis of Hotels’Arabic Reviews based on LSTM
Manal Nejjari ... Abdelouafi Meziane
-
Manal Nejjari, et. al.Manal Nejjari ... Abdelouafi Meziane
24 Nov 2020
24 Nov 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Annotated Huge Dataset for Standard and Colloquial Arabic Reviews for Subjective Sentiment Analysis

Abstract

Talk to us

Similar Papers

More From: Procedia Computer Science