Abstract

At a time when research in the field of sentiment analysis tends to study advanced topics in languages, such as English, other languages such as Arabic still suffer from basic problems and challenges, most notably the availability of large corpora. Furthermore, manual annotation is time-consuming and difficult when the corpus is too large. This paper presents a semi-supervised self-learning technique, to extend an Arabic sentiment annotated corpus with unlabeled data, named AraSenCorpus. We use a neural network to train a set of models on a manually labeled dataset containing 15,000 tweets. We used these models to extend the corpus to a large Arabic sentiment corpus called “AraSenCorpus”. AraSenCorpus contains 4.5 million tweets and covers both modern standard Arabic and some of the Arabic dialects. The long-short term memory (LSTM) deep learning classifier is used to train and test the final corpus. We evaluate our proposed framework on two external benchmark datasets to ensure the improvement of the Arabic sentiment classification. The experimental results show that our corpus outperforms the existing state-of-the-art systems.

Highlights

  • Several tasks in natural language processing require annotated corpora for training and evaluation methods and comparing the different systems [1]

  • The evaluation of the AraSenCorpus corpus was carried out using two benchmark datasets to evaluate the effectiveness of the semi-supervision annotation

  • The sentiment classification on two-way classification using our corpus improves the results by 7% and 5% using the SemEval 2017 and Arabic Sentiment Tweets Dataset (ASTD) benchmark datasets, respectively

Read more

Summary

Introduction

Several tasks in natural language processing require annotated corpora for training and evaluation methods and comparing the different systems [1]. The process of manual annotation of corpora is usually costly and becomes prohibitive when scaled to a larger dataset [2]. For popular tasks on natural language processing, such as sentiment analysis, we can find widely used corpora that serve as baselines for approaches and methods proposed for the sentiment analysis task. There are several corpora for the task of Arabic sentiment analysis, but the high costs associated with manual annotation limit these resources to be either small or obtained through entirely automatic methods such as user rates or Arabic sentiment lexicons. The data presented in these corpora are outdated, incomplete, or small

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call