A novel unsupervised ensemble framework using concept-based linguistic methods and machine learning for twitter sentiment analysis

Maryum Bibi,Celestine Iwendi,Thippa Reddy Gadekallu,Sundus Khalil,Wajid Aziz,Mueen Uddin,Wajid Arshad Abbasi

doi:10.1016/j.patrec.2022.04.004

Abstract

• A Novel ensemble/cooperative framework based on concept-based and clustering is proposed to perform Twitter sentiment Analysis. • It employs majority voting, tie breaker criteria, and linguistic rules in concept-based module. • Comparative analysis between clustering and classification is presented when integrated with concept based methods. • It presents the performance of feature representation methods (Boolean and TF-IDF). • Experimental results on Twitter Datasets revealed better performance of proposed framework. Concept-based sentiment analysis (CBSA) methods have gained prominence in natural language processing in recent years. These methods consider the underlying semantic meanings of text to perform different tasks such as Twitter sentiment analysis (assigning positive, negative, or neutral sentiment to Tweets). CBSA is superior to traditional statistical methods for accurately discovering sentiment labels. Due to a limited knowledge base, these methods are unable to identify the sentiment polarity of all kinds of text. Therefore, supervised learning techniques are mostly ensembled with CBSA methods to classify the whole text. These techniques require labeled data. It is a tedious and time-consuming task due to the manually labeling of large datasets (Such as Twitter datasets). Therefore, an unsupervised learning mechanism can be a better alternative to solve this problem. In this paper, a novel unsupervised learning framework based on Concept-based and hierarchical clustering is proposed for Twitter sentiment analysis. Popular hierarchical clustering methods including single linkage, complete linkage, and average linkage algorithms are ensembled serially. Two different feature representation methods including Boolean and Term frequency-inverse document frequency (TF-IDF) are investigated. We have also experimented with Well-known classifiers (Naive Bayes, Neural Network) for a fair comparison. Accuracy measure (proportion of correct predictions) is used to evaluate the performance of understudied techniques. It is empirically shown that the performance of unsupervised learning techniques is comparable with supervised learning techniques.

Full Text