Multi-Class Text Classification of Uzbek News Articles using Machine Learning

I M Rabbimov,S S Kobilov

doi:10.1088/1742-6596/1546/1/012097

Abstract

A large amount of online news on various topics is being posted on the Internet. One of the tasks of processing this data is to provide the user with appropriate methods and tools for quick and easy search for important and interesting news. An approach to solve this problem is the reasonable distribution of news into respective classes. This increases the importance of automated classification of an electronic document section. In this paper, we consider the task of multi-class text classification for the texts written in Uzbek. The articles on ten categories were selected from the Uzbek “Daryo” online news edition and a dataset was developed for them. When performing multi-class text classification for this dataset, the following 6 different machine learning algorithms were used: Support Vector Machines (SVM), Decision Tree Classifier (DTC), Random Forest (RF), Logistic Regression (LR) and Multinomial Naïve Bayes (MNB). A detailed technological description of the stages of the proposed functional scheme of text classification and developed software is given. The TF-IDF algorithm and word-level and character-level n-gram models were used as the feature extraction methods. When defining hyperparameters for text classification, 5-fold cross-validation was used. Experiments were conducted and the highest accuracy was 86.88%. The models and methods that are proposed in this paper can be successfully used in the classification of texts written in the Uzbek language and further research in this area.

Full Text