Arabic text classification using master-slaves technique

Zinah Abdulridha Abutiheen,Kadhim B S Aljanabi,Ahmed H Aliwy

doi:10.1088/1742-6596/1032/1/012052

Zinah Abdulridha Abutiheen, Kadhim B S Aljanabi + Show 1 more

Open Access

https://doi.org/10.1088/1742-6596/1032/1/012052

Copy DOI

Abstract

Text classification (TC) is an essential field in both text mining (TM) and natural language processing (NLP). Humans have a tendency to organize and categorize everything as they want to make things easier to understand. Therefore, text classification is an important step to achieve this goal. Arabic text classification (ATC) is a difficult process because the Arabic language has complications and limitations resulting from the nature of its morphology. In this paper, a proposed approach called the Master-Slaves technique (MST) was used to improve Arabic text classification. It consists of two main phases: in the first phase, a new Arabic corpus of 16757 text files was collected. These text files were classified into five categories manually. In the second phase, four different classifiers were implemented on the collected corpus. These classifiers are Naïve Bayes (NB), K-Nearest Neighbour (KNN), Multinomial Logistic Regression (MLR) and Maximum Weight (MW). Naïve Bayes classifier was implemented as Master and the others as Slaves. The results of these slave classifiers were used to change the probability of the Naïve Bayes classifier (Master). The four classifiers used were implemented individually and the simple voting technique was implemented among them too on the collected corpus to check the effectiveness and efficiency of the proposed technique. All the tests were applied after the pre-processing of Arabic text documents (tokenization, stemming, and stop-word removal) and each document was represented as vector of weights. For the reliability of the results, 10-fold cross-validation was used in this paper. The results showed that the Master-slaves technique gives a good improvement in accuracy of text document classification with accepted algorithm complexity compared to other techniques.

Full Text