Corpus-Based Hashing Count Frequency Vectorization of Sentiment Analysis of Movie Reviews

M Shyamala Devi,G Bhavisha,B Lohitha,Y Lakshmi Akshitha,M Anusha,R Aruna,G Chandana

doi:10.1007/978-981-19-2130-8_10

Abstract

AbstractSentiment analysis is textual context—specific extraction that identifies and extracts qualitative information from various information, assisting businesses in understanding the social emotion of their model, product, brand, or service while measuring online conversations. With the advancement in entertainment scenario, the analysis of customer real-time pulse while watching movie is essentially needed for rating the movie and to analyze the success of movie toward marketing industry. The machine learning can be used for predicting the sentiment of the movie review given by the customers. The movie review dataset from KAGGLE repository with ‘117211’ movie reviews is used for analyzing the type of movie based on the review comments. The movie review dataset is preprocessed by removing the stop words and unwanted tokens from the dataset. The tokens are extracted from the text using NGram method. The emotional labels are substituted for the tokens, and the machine is trained to identify the sentiment emotions during learning phase. The sentimental emotional labels are converted into features to form corpus text for predicting the emotions in the movie reviews. The corpus is splitted to form training and testing dataset and fitted to Count Vectorizer, Hash Vectorizer, TFID Transformer, and TFIDVectorizer to extract the important features from the text conversation. The extracted features from the movie reviews are subjected to all the classifiers to analyze the performance of the emotion prediction. The scripting is written in Python and implemented with Spyder in Anaconda Navigator IDE, and the experimental results show that the decision tree classifier with Hash Vectorizer is exhibiting 98.7% of accuracy toward predicting the sentiment from the movie reviews.KeywordsMachine learningTokensCorpusVectorizerTFIDAccuracy

Full Text