Improved Document Categorization Through Feature-Rich Combinations

Anoual El Kah,Imad Zeroual

doi:10.1007/978-3-030-76346-6_32

Abstract

Several comparatives studies report new findings relevant to the Text Categorization (TC) task, and all provide valuable observations. However, many of them addressed western languages, especially English. By writing this paper, we take a step toward filling this gap and focus on less commonly investigated languages (i.e., Arabic) to provide a more balanced perspective. In that respect, this paper presents a deeper investigation regarding the performance of some well-known probabilistic methods successfully implemented for automatic TC, such as Naïve Bayesian, Support Vector Machines, and Decision Tree. Besides, the investigation covers pre-processing techniques and feature selection methods that deal with data’s high dimensionality. Expressly, stop words elimination, stemming, and lemmatization are the pre-processing techniques included along with the TF-IDF and Chi-square as the feature selection methods. Moreover, all possible combinations are considered. To make this study accurate and comprehensive, we trained and evaluated the selected classifiers and the pre-processing techniques on common ground. To this end, we used an in-house balanced and large corpus with 300,000 news articles which are equally distributed into six categories. The findings obtained prove the effectiveness of combining the pre-processing techniques, feature selection methods, and classifiers.

Full Text