Abstract

The paper examines combining words and concepts for text representation for Arabic Automatic Text Classification (ATC) and its impact on the accuracy of the classification, when used with various stemming methods and classifiers. An experimental Arabic ATC system was developed and the effects of its main components on the classification accuracy are assessed. Firstly, variants of the standard Bag-of-Words model with different stemming methods are examined and compared. Arabic Wikipedia and WordNet were examined and compared for providing concepts for effective Bag-of-Concepts representation. Based on this, Wikipedia was then utilized to provide concepts, and different strategies for combining words and concepts, including two new in-house developed approaches, were examined for effective Arabic text representation in terms of their impact on the overall classification accuracy. Our experimental results show that text representation is a key element in the performance of Arabic ATC, and combining words and concepts to represent Arabic text enhances the classification accuracy as compared to using words or concepts alone.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call