LSTM, VADER and TF-IDF based Hybrid Sentiment Analysis Model

Mohamed Chiny,Younes Chihab,Omar Bencharef,Marouane Chihab

doi:10.14569/ijacsa.2021.0120730

Mohamed Chiny, Younes Chihab + Show 2 more

Open Access

https://doi.org/10.14569/ijacsa.2021.0120730

Copy DOI

Abstract

Most sentiment analysis models that use supervised learning algorithms consume a lot of labeled data in the training phase in order to give satisfactory results. This is usually expensive and leads to high labor costs in real-world applications. This work consists in proposing a hybrid sentiment analysis model based on a Long Short-Term Memory network, a rule-based sentiment analysis lexicon and the Term Frequency-Inverse Document Frequency weighting method. These three (input) models are combined in a binary classification model. In the latter, each of these algorithms has been implemented: Logistic Regression, k-Nearest Neighbors, Random Forest, Support Vector Machine and Naive Bayes. Then, the model has been trained on a limited amount of data from the IMDB dataset. The results of the evaluation on the IMDB data show a significant improvement in the Accuracy and F1 score compared to the best scores recorded by the three input models separately. On the other hand, the proposed model was able to transfer the knowledge gained on the IMDB dataset to better handle a new data from Twitter US Airlines Sentiments dataset.

Highlights

IntroductionWith the massive use of social networks such as Facebook, Twitter and Instagram, and dedicated platforms for sharing reviews and comments such as IMDB and Airbnb; it has become extremely difficult to track down published information, let alone extract relevant information such as reviews about a product or service, on the one hand, because of the abundance and variety of published data [1], and on the other hand because of the unstructured nature of the published texts, which makes it almost impossible to analyze them by classical computer methods [2].The content produced by the social media community reflects one of the richest sources of data in terms of opinions and knowledge, and offers greater opportunities for businesses, governments, and society to extract valuable, expressive, and diverse knowledge, both in terms of the content itself and context-related knowledge [3]
According to the results obtained, the proposed model shows better performances in terms of accuracy and F1 score, and which can exceed the performances of the best among the three input models (LSTM, Valence Aware Dictionary and sEntiment Reasoner (VADER) and term frequency-inverse document frequency (TF-IDF)) by 5.91% for accuracy and 5.51% for F1 Score
The content created by users of social media and dedicated platforms reflects one of the richest sources of data in terms of opinions and knowledge

Summary

Introduction

With the massive use of social networks such as Facebook, Twitter and Instagram, and dedicated platforms for sharing reviews and comments such as IMDB and Airbnb; it has become extremely difficult to track down published information, let alone extract relevant information such as reviews about a product or service, on the one hand, because of the abundance and variety of published data [1], and on the other hand because of the unstructured nature of the published texts, which makes it almost impossible to analyze them by classical computer methods [2].The content produced by the social media community reflects one of the richest sources of data in terms of opinions and knowledge, and offers greater opportunities for businesses, governments, and society to extract valuable, expressive, and diverse knowledge, both in terms of the content itself and context-related knowledge [3]. With the massive use of social networks such as Facebook, Twitter and Instagram, and dedicated platforms for sharing reviews and comments such as IMDB and Airbnb; it has become extremely difficult to track down published information, let alone extract relevant information such as reviews about a product or service, on the one hand, because of the abundance and variety of published data [1], and on the other hand because of the unstructured nature of the published texts, which makes it almost impossible to analyze them by classical computer methods [2]. Sentiment analysis is a field of analysis that aims to determine the opinion and subjectivity of people's criticisms and attitudes towards entities and its attributes from unstructured written text [4]. As an example, based on the emotional attributes of words, Turny [5] used a simple unsupervised classification learning algorithm to compute pointwise mutual information to measure sentence sentiment polarity

Objectives

Results

Discussion

Conclusion