The primary goal of this paper is to develop optimized classical machine learning models for sentiment analysis of social media posts. The methodology includes the application of k-fold cross-validation and feature selection using the χ2 independence test to prevent overfitting and improve classifier accuracy. Classical algorithms such as naïve Bayes, support vector machines, decision trees, and k-nearest neighbors are applied, along with preprocessing techniques, across five distinct datasets for model construction and evaluation. These optimized classical models are compared with recurrent neural network architectures, including LSTM and GRU, to evaluate the relative efficiency of both approaches. In total, 34 classification models were generated, with hyperparameter optimization for classical methods performed by grid search. The highest accuracy achieved was 82.45% for data without preprocessing and 78.83% for fully preprocessed data, both with the naïve Bayes algorithm. After hyperparameter optimization, some models achieved an accuracy greater than 90%. An analysis of variance indicated statistically significant differences among the models, confirming that feature selection and hyperparameter optimization are important factors in classifier performance. The proposed approach proved efficient for analyzing unstructured textual data, allowing the development of optimized models even with limited labeled data, and providing valuable insights into user opinions across different contexts.
Read full abstract