A New Feature Selection Method for Sentiment Analysis in Short Text

H M Keerthi Kumar,B S Harish

doi:10.1515/jisys-2018-0171

Abstract

Abstract In recent internet era, micro-blogging sites produce enormous amount of short textual information, which appears in the form of opinions or sentiments of users. Sentiment analysis is a challenging task in short text, due to use of formal language, misspellings, and shortened forms of words, which leads to high dimensionality and sparsity. In order to deal with these challenges, this paper proposes a novel, simple, and yet effective feature selection method, to select frequently distributed features related to each class. In this paper, the feature selection method is based on class-wise information, to identify the relevant feature related to each class. We evaluate the proposed feature selection method by comparing with existing feature selection methods like chi-square ( χ 2), entropy, information gain, and mutual information. The performances are evaluated using classification accuracy obtained from support vector machine, K nearest neighbors, and random forest classifiers on two publically available datasets viz., Stanford Twitter dataset and Ravikiran Janardhana dataset. In order to demonstrate the effectiveness of the proposed feature selection method, we conducted extensive experimentation by selecting different feature sets. The proposed feature selection method outperforms the existing feature selection methods in terms of classification accuracy on the Stanford Twitter dataset. Similarly, the proposed method performs competently equally in terms of classification accuracy compared to other feature selection methods in most of the feature subsets on Ravikiran Janardhana dataset.

Highlights

The popularity of micro-blog applications in the recent decade generates enormous amount of short textual information
Sentiment analysis is a challenging task in short text, due to use of formal language, misspellings, and shortened forms of words, which leads to high dimensionality and sparsity
There are many challenges that need to be addressed i.e. use of formal language, misspellings, and shortened form of words, which leads to high dimensionality and sparsity

Summary

Introduction

The popularity of micro-blog applications in the recent decade generates enormous amount of short textual information. Millions of users make use of micro-blog sites to express their opinion or sentiment related to a product, topic, or events which take place in day to day life [39]. An opinion may be regarded as statements in which the opinion holder makes specific claim about a topic using certain sentiment [8, 24]. Many marketing companies use micro-blog textual information to identify sentiments related to the product or an event [10, 58]. The detection and analysis of sentiments in short texts is an attractive topic, for many researchers and practitioners, to classify text into different polarities or classes

Methods

Findings

Discussion

Conclusion