Tweets Classification on the Base of Sentiments for US Airline Companies

Gyu Choi,Imran Ashraf,Furqan Rustam,Arif Mehmood,Saleem Ullah

doi:10.3390/e21111078

Abstract

The use of data from social networks such as Twitter has been increased during the last few years to improve political campaigns, quality of products and services, sentiment analysis, etc. Tweets classification based on user sentiments is a collaborative and important task for many organizations. This paper proposes a voting classifier (VC) to help sentiment analysis for such organizations. The VC is based on logistic regression (LR) and stochastic gradient descent classifier (SGDC) and uses a soft voting mechanism to make the final prediction. Tweets were classified into positive, negative and neutral classes based on the sentiments they contain. In addition, a variety of machine learning classifiers were evaluated using accuracy, precision, recall and F1 score as the performance metrics. The impact of feature extraction techniques, including term frequency (TF), term frequency-inverse document frequency (TF-IDF), and word2vec, on classification accuracy was investigated as well. Moreover, the performance of a deep long short-term memory (LSTM) network was analyzed on the selected dataset. The results show that the proposed VC performs better than that of other classifiers. The VC is able to achieve an accuracy of 0.789, and 0.791 with TF and TF-IDF feature extraction, respectively. The results demonstrate that ensemble classifiers achieve higher accuracy than non-ensemble classifiers. Experiments further proved that the performance of machine learning classifiers is better when TF-IDF is used as the feature extraction method. Word2vec feature extraction performs worse than TF and TF-IDF feature extraction. The LSTM achieves a lower accuracy than machine learning classifiers.

Highlights

Text mining is one of the distinguished fields of data mining which possesses the potential to extract useful information from raw data
The results demonstrate that calibrated classifier (CC), voting classifier (VC), logistic regression (LR), extra trees classifier (ETC), and random forest (RF) perform better with both term frequency (TF) and term frequency-inverse document frequency (TF-inverse document frequency (IDF)) feature extraction methods
This paper proposes a voting classifier that is based on logistic regression and stochastic gradient descent classifier

Summary

Introduction

Text mining is one of the distinguished fields of data mining which possesses the potential to extract useful information from raw data. Text classification is becoming a prominent field of research in text mining, especially after the inception and penetration of social platforms such as Facebook, Twitter, etc. People express their views on such platforms and their opinions serve as the guideline to design and govern the policies of various companies. The wide use of such social platforms leads to generate many data that contain a variety of potential information

Methods

Results

Conclusion