Dimensionality reduction for sentiment analysis using pre-processing techniques

Mayuri Mhatre,Pranali Kadam,Dakshata Phondekar,Anushka Chawathe,Kranti Ghag

doi:10.1109/iccmc.2017.8282676

Abstract

Sentiment analysis is the study of people's opinions, sentiments, attitudes and emotions, expressed in written language but this process is time consuming, inconsistent and costly in business context. Pre-processing the data will help to ease this difficulty. Pre-processing is the process of cleaning and preparing the text for its analysis using pre-processing techniques. The existing pre-processing techniques are Handling Expressive Lengthening, Emoticons Handling, HTML Tags Removal, Punctuations Handling, Slangs Handling, Stopwords Removal, Stemming and Lemmatization. In this paper, the effect of various pre-processing techniques and their combinations was analyzed on the dataset taken from Kaggle called Bag of Words Meets Bags of Popcorn. By taking every possible combination of pre-processing techniques, the aim was to find the one giving highest accuracy. Random Forest Classifier was used to predict sentiments as it is known to give good accuracy and the result was evaluated using 10 fold cross validation method. Accuracy increased from unprocessed data to pre-processed data. It was concluded that using pre-processing techniques gives a higher accuracy than the traditional approach i.e. no pre-processing.

Full Text