Abstract

The feature type (FT) chosen for extraction from the text and presented to the classification algorithm (CAL) is one of the factors affecting text classification (TC) accuracy. Character N-grams, word roots, word stems, and single words have been used as features for Arabic TC (ATC). A survey of current literature shows that no prior studies have been conducted on the effect of using word N-grams (N consecutive words) on ATC accuracy. Consequently, we have conducted 576 experiments using four FTs (single words, 2-grams, 3-grams, and 4-grams), four feature selection methods (document frequency (DF), chi-squared, information gain, and Galavotti, Sebastiani, Simi) with four thresholds for numbers of features (50, 100, 150, and 200), three data representation schemas (Boolean, term frequency-inversed document frequency, and lookup table convolution), and three CALs (naive Bayes (NB), k-nearest neighbor (KNN), and support vector machine (SVM)). Our results show that the use of single words as a feature provides greater classification accuracy (CA) for ATC compared to N-grams. Moreover, CA decreases by 17% on average when the number of N-grams increases. The data also show that the SVM CAL provides greater CA than NB and KNN; however, the best CA for 2-grams, 3-grams, and 4-grams is achieved when the NB CAL is used with Boolean representation and the number of features is 200.KeywordsArabic text classificationfeature extractionclassification algorithmsclassification accuracy

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call