Recently, the number of Yemeni users has been expanding quickly on social media platforms. Most research in Arabic sentiment analysis has gained on Modern Standard Arabic (MSA) and some specific dialects, such as Egyptian, Levantine, and Gulf. However, there is a noticeable gap in Yemeni dialect sentiment analysis research. The reason for that is the lack of reliable Yemeni lexical and corpus and a real dataset for social media sentiment analysis. This research addresses this lack by presenting the Yemeni Dialect sentiment lexicon and corpus. This lexicon and corpus provide valuable resources for researchers and practitioners seeking to analyze sentiment in Yemeni dialect social media content, contributing to a better understanding of Yemeni public opinion, social media monitoring, marketing, cultural understanding, and assisting in efforts to respond to crises in Yemen. The Yemeni Dialect sentiment lexicon is enriched with a reasonable number of words and phrases categorized according to their positive and negative sentiment tendencies. Moreover, we constructed a corpus dataset of more than 54,000 comments built from the Facebook platform. A large dataset of unlabeled comments from the main Yemeni telecommunications companies in Yemen (Yemen Telecom, Yemen Mobile, YOU, and Sabafon), are people commenting on a public issue related to the services provided by those companies. The lexicon-based approach is used to extract the sentiment’s polarity and label each of the provided comments to formulate a corpus dataset as being either positive, negative, or neutral. The evaluation metrics of experiments are accuracy, recall, precision, f-measure, and the confusion matrix. The accuracy result of the lexicon-based labeling approach was calculated through a comparison between the achieved results and the ones achieved through manually labeled comments by three Yemeni experts. Evaluation results using a lexicon-based approach achieved an accuracy of 90.05%.
Read full abstract