Abstract

Several comparatives studies report new findings relevant to the Text Categorization (TC) task, and all provide valuable observations. However, many of them addressed western languages, especially English. By writing this paper, we take a step toward filling this gap and focus on less commonly investigated languages (i.e., Arabic) to provide a more balanced perspective. In that respect, this paper presents a deeper investigation regarding the performance of some well-known probabilistic methods successfully implemented for automatic TC, such as Naïve Bayesian, Support Vector Machines, and Decision Tree. Besides, the investigation covers pre-processing techniques and feature selection methods that deal with data’s high dimensionality. Expressly, stop words elimination, stemming, and lemmatization are the pre-processing techniques included along with the TF-IDF and Chi-square as the feature selection methods. Moreover, all possible combinations are considered. To make this study accurate and comprehensive, we trained and evaluated the selected classifiers and the pre-processing techniques on common ground. To this end, we used an in-house balanced and large corpus with 300,000 news articles which are equally distributed into six categories. The findings obtained prove the effectiveness of combining the pre-processing techniques, feature selection methods, and classifiers.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.