Arabic Text Classification: A Comparative Approach Using a Big Dataset

Mokhtar Ali Hasan Madhfar,Mohammed Abdullah Hassan Al-Hagery

doi:10.1109/iccisci.2019.8716479

Abstract

Text classification is the process of categorizing text documents to a predefined group of categories. This paper aims to provide experimental evaluations of six well-known classification models in classifying a large Arabic corpus. These models are Nave Bayes (NB), Support Vector Machine (SVM), Random Forest, Logistic Regression, Decision Tree (DT), and Stochastic Gradient Descent (SGD). We used a corpus consisting of 111,728 Arabic documents that fall into five categories: sport, culture, economy, news, and diverse. Three performance metrics were applied to evaluate the experimental results of each model. The experimental results show that the Logistic Regression model scores the highest weighted F 1 score, followed by SGD, SVM, NB, Random Forest, and DT. The experiments also show that both feature size and corpus size have high impacts on the performance of the models.

Full Text