Role of Discourse Information in Urdu Sentiment Classification

Dr Muhammad Awais,Dr Muhammad Shoaib

doi:10.1145/3300050

Abstract

In computational linguistics, sentiment analysis refers to the classification of opinions in a positive class or a negative class. There exist a lot of different methods for sentiment analysis of the English language, but the literature lacks the availability of methods and techniques for Urdu, which is the largely spoken language in the South Asian sub-continent and the national language of Pakistan. The currently available techniques, such as adjective count method known as Bag of Words (BoW), is not sufficient for classification of complex sentiment written in the Urdu language. Also, the performance of available machine-learning techniques (with legacy features), for classification of Urdu sentiments, are not comparable with the achieved accuracy of other languages. In the case of the English language, the discourse information (sub-sentence-level information) boosts the performance of both the BoW method and machine-learning techniques, but there are very few works available that have tested the context-level information for the sentiment analysis of the Urdu language. This research aims to extract the discourse information from the Urdu sentiments and utilise the discourse information to improve the performance and reduce the error rate of existing techniques for Urdu Sentiment classification. The proposed solution extracts the discourse information, suggests a new set of features for machine-learning techniques, and introduces a set of rules to extend the capabilities of the BoW model. The results show that the task has been enhanced significantly and the performance metrics such as recall, precision, and accuracy are increased by 31.25%, 8.46%, and 21.6%, respectively. In future, the proposed technique can be extended to sentiments with more than two sub-opinions, such as for blogs, reviews, and TV talk shows.

Full Text