Bengali paper classification using ensemble machine learning algorithms

Niaz Ashraf Khan,Emrul Hasan Zawad,Rashedur M Rahman

doi:10.1504/ijkesdp.2022.127625

Abstract

Text classification is one of the most challenging problems in natural language processing (NLP). Language models are at the heart of NLP. The ability to represent texts as numbers has given rise to many NLP tasks, for example, text categorisation, translation, and summarisation. Unfortunately, NLP for Bengali texts has not reached the state-of-art level of other Languages like English yet, mostly due to the scarcity of resources and the complexities seen in Bengali grammar. Therefore, not much work has been done in this field. In this paper, we have studied one of the word embedding methods, Word2vec, based on continuous bag of words (CBOW) with several ensemble machine learning algorithms, e.g., Adaptive Boosting Classifiers, Light Gradient Boosting Machine, XGboost, and random forest classifiers (RFC). The model is trained on a large corpus of Bengali newspapers of a considerable size that has 99283949 words and 8284804 sentences in 392772 documents. In our experiment, Word2vec CBOW model with XGboost algorithm performed much better than other models and achieved 92.24% accuracy.

Full Text