Abstract

Sentiment analysis is an important sub-task of Natural Language Processing that aims to determine the polarity of a review. Most of the work done on sentiment analysis is for the resource-rich languages of the world, but very limited work has been done on resource-poor languages. In this work, we focus on developing a Sentiment Analysis System for Roman Urdu, which is a resource-poor language. To this end, a dataset of 11,000 reviews has been gathered from six different domains. Comprehensive annotation guidelines were defined and the dataset was annotated using the multi-annotator methodology. Using the annotated dataset, state-of-the-art algorithms were used to build a sentiment analysis system. To improve the results of these algorithms, four different studies were carried out based on: word-level features, character level features, and feature union. The best results showed that we could reduce the error rate by 12% from the baseline (80.07%). Also, to see if the improvements are statistically significant, we applied t-test and Confidence Interval on the obtained results and found that the best results of each study are statistically significant from the baseline.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call