Feature extraction plays a critical role in text classification, as it converts textual data into numerical representations suitable for machine learning models. A key challenge lies in effectively capturing both semantic and contextual information from text at various levels of granularity while avoiding overfitting. Prior methods have often demonstrated suboptimal performance, largely due to the limitations of the feature extraction techniques employed. To address these challenges, this study introduces Multi-TextCNN, an advanced feature extractor designed to capture essential textual information across multiple levels of granularity. Multi-TextCNN is integrated into a proposed classification model named MuTCELM, which aims to enhance text classification performance. The proposed MuTCELM leverages five distinct sub-classifiers, each designed to capture different linguistic features from the text data. These sub-classifiers are integrated into an ensemble framework, enhancing the overall model performance by combining their complementary strengths. Empirical results indicate that MuTCELM achieves average improvements across all datasets in accuracy, precision, recall, and F1-macro scores by 0.2584, 0.2546, 0.2668, and 0.2612, respectively, demonstrating significant performance gains over baseline models. These findings underscore the effectiveness of Multi-TextCNN in improving model performance relative to other ensemble methods. Further analysis reveals that the non-overlapping confidence intervals between MuTCELM and baseline models indicate statistically significant differences, suggesting that the observed performance improvements of MuTCELM are not attributable to random chance but are indeed statistically meaningful. This evidence indicates the robustness and superiority of MuTCELM across various languages and text classification tasks.
Read full abstract