Efficient Feature Representation Based on the Effect of Words Frequency for Arabic Documents Classification

Yousif A Alhaj,Mohammed A A Al-Qaness,Aamir Hussain,Wiraj Udara Wickramaarachchi,Hammam M Abdelaal

doi:10.1145/3291842.3291900

Abstract

This paper is based on the influence of the frequency of words in the classification of Arabic documents, its effects on the representation of characteristics namely Bag of word (Bow) and Term frequency- Inverse Documents Frequency (TF-IDF). Three classification techniques are being discussed, namely Naive Bayes (NB), k-nearest Neighbor (KNN) and Support Vector Machine (SVM). The Chi-square is used as a selection function to select essential features and remove unnecessary features. An experiment in the classification of Arab documents of public data collected from Arab sites, namely the CNN Arabic Corpus, to study the performance of the classification. The K-fold to validate the classifier and The F1-Micro to test the classifier. Recent results show that SVM classifier was upgraded to KNN and NB classifiers using the TF-IDF representation approach and that the NB classifier outperformed the KNN and SVM classifiers when using the representation approach in Bow. The SVM and NB classifiers attached 94.38% and 93.47% Micro-F1 are worth eliminating the word.

Full Text