Abstract

Text classification is the process of categorizing text documents to a predefined group of categories. This paper aims to provide experimental evaluations of six well-known classification models in classifying a large Arabic corpus. These models are Nave Bayes (NB), Support Vector Machine (SVM), Random Forest, Logistic Regression, Decision Tree (DT), and Stochastic Gradient Descent (SGD). We used a corpus consisting of 111,728 Arabic documents that fall into five categories: sport, culture, economy, news, and diverse. Three performance metrics were applied to evaluate the experimental results of each model. The experimental results show that the Logistic Regression model scores the highest weighted F 1 score, followed by SGD, SVM, NB, Random Forest, and DT. The experiments also show that both feature size and corpus size have high impacts on the performance of the models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.