Classifying Hindi News Using Various Machine Learning and Deep Learning Techniques

Rachna Jain,Andreas Kanavos,Monika Arora,Anusha Chhabra,Vassilis C Gerogiannis,Biswaranjan Acharya,Arpit Sharma,Dimitrios Tzimos,Harsh Singh,Saurabh Verma

doi:10.1142/s0218213023500641

Abstract

Text classification involves organizing textual information into predefined classes, a task which is particularly useful in domains like sentiment analysis, spam detection, and content labeling. In India, where a massive amount of information is generated daily through newspapers and social media, Hindi is one of the most widely used and spoken languages. However, there is limited research on Hindi text classification and, particularly, regarding Hindi news classification. This paper presents a research study to classify Hindi news articles published in Hindi-language newspapers in India by using and comparing various Machine Learning (ML) and Deep Learning (DL) algorithms. To prepare the textual news data for classification, pre-processing and feature engineering techniques, such as count vectorizer, Tf-Idf vectorizer and Doc2Vec, were used and applied to convert texts into vectors. This pre-processing step on the textual data was very challenging due to the presence of multimodal words, conjunctions, punctuation, and special characters in Hindi texts. The study considered Hindi news headlines from predetermined categories (Science, Sports, Entertainment and Business) and, among the different ML and DL models tested and evaluated, Linear Regression with Doc2Vec vectorizer and SGD classifier with Tf-Idf vectorizer produced best accuracies of 97.04% and 96.59%, respectively. The best performing DL model was found to be the Bi-LSTM with an accuracy of approximately 97% on the testing data.

Full Text