A Study on Machine Learning and Deep Learning Methods Using Feature Extraction for Bengali News Document Classification

Nusrat Humaira,Summit Haque,Humaira Afia

doi:10.1109/asiancon51346.2021.9544761

Abstract

News is newly received remarkable facts about current phenomenon. Miscellaneous facts are constantly happening in this world. Mass media helps to reach these facts to the common folks widely. As we are pushed forward to modern world, getting a convenient environment, Bengali mass media are also leaning towards digital platforms. In this article, some supervised machine learning approaches and deep learning approaches have been proposed for classifying Bengali news documents. We have used an open dataset for our work which contains more than three hundred thousand (3, 76, 211) Bengali text documents. Removing stop-words, dropping duplicate data, tokenizing, stemming etc have been commonly done as preprocessing steps. Bag-of-Words with TF-IDF and some Word Embedding approaches - Average Word2Vec, Glove & fastText have been used for feature extraction. We have trained our text corpus using supervised machine learning method and Deep learning method. Significantly, among these models, Support Vector Machine with average Word2Vec has achieved 97% accuracy and Bidirectional LSTM has achieved 96% accuracy.

Full Text