Text categorization involves assigning predefined category labels to an unlabeled document. With the exponential growth in the accessibility and availability of digital documents over the past decade, this field significantly attracted the scientific community that immensely demands rapid and accurate categorization of these documents. Relying on experts for manual classification is time-consuming and resource-intensive. Consequently, labeling unlabeled digital documents faster more accurately, and more efficiently is inescapable. One promising approach to addressing this demand is the use of machine learning algorithms. Training these algorithms on a large dataset of labeled texts lets them learn patterns and predicted unlabeled documents. This strategy might greatly expedite the categorizing process while retaining a substantial level of accuracy through leveraging artificial intelligence. These algorithms have also enhanced natural language processing techniques, making them more accurate at classifying unlabeled digital documents. In this study, we propose a novel machine-learning computational framework to address this challenge. Our framework incorporates a novel Bangla stemmer, which reduces words to their stems. We then employed TF-IDF for document vectorization, a statistical measure assessing word relevance for categorization purposes. Experimental results reveal that our framework significantly enhances prediction performance, achieving an impressive 95.3% prediction accuracy.
Read full abstract