Abstract

Sentiment analysis is one of the prominent research areas in data mining and knowledge discovery, which has proven to be an effective technique for monitoring public opinion. The big data era with a high volume of data generated by a variety of sources has provided enhanced opportunities for utilizing sentiment analysis in various domains. In order to take best advantage of the high volume of data for accurate sentiment analysis, it is essential to clean the data before the analysis, as irrelevant or redundant data will hinder extracting valuable information. In this paper, we propose a hybrid feature selection algorithm to improve the performance of sentiment analysis tasks. Our proposed sentiment analysis approach builds a binary classification model based on two feature selection techniques: an entropy-based metric and an evolutionary algorithm. We have performed comprehensive experiments in two different domains using a benchmark dataset, Stanford Sentiment Treebank, and a real-world dataset we have created based on World Health Organization (WHO) public speeches regarding COVID-19. The proposed feature selection model is shown to achieve significant performance improvements in both datasets, increasing classification accuracy for all utilized machine learning and text representation technique combinations. Moreover, it achieves over 70% reduction in feature size, which provides efficiency in computation time and space.

Highlights

  • The significant advances in data storage, communication and processing technologies in recent years have given rise to the big data era, with a plethora of information flowing in from various data sources at high speeds

  • One of the main challenges in sentiment classification is the high amount of data that contain irrelevant or redundant features [27], which adversely affect the performance of machine learning models [28]

  • In this paper, we proposed a hybrid multiobjective feature selection algorithm to improve the performance of the sentiment classification task in various domains

Read more

Summary

INTRODUCTION

The significant advances in data storage, communication and processing technologies in recent years have given rise to the big data era, with a plethora of information flowing in from various data sources at high speeds. One of the main challenges in sentiment classification is the high amount of data that contain irrelevant or redundant features [27], which adversely affect the performance of machine learning models [28]. There exist feature selection methods that combine filter and wrapper based approaches for sentiment analysis [36], [37], all of them approach the problem in a single objective perspective. We propose a new hybrid multiobjective feature selection model for the sentiment analysis task, which harnesses the power of an entropy-based metric, i.e., Information Gain, and an evolutionary algorithm, i.e., Nondominated Sorting Genetic Algorithm II (NSGA-II). Experiments with different machine learning and feature extraction techniques on the well-known Stanford Sentiment Treebank dataset demonstrate that our proposed model improves the learning performance of the sentiment analysis task considerably.

RELATED WORK
PROPOSED MODEL
EXPERIMENT RESULTS
Findings
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.