Abstract

Text classification (TC) is one of the fundamental problems in text mining. Plenty of works exist on TC with interesting approaches and excellent results; however, most of these works follow a word-based approach for feature extraction. In this work, we are interested in an alternative (byte-based or character-based) approach known as compression-based TC (CTC). CTC has been used for some languages such as English and Portuguese and it is shown to have certain advantages/ disadvantages compared with word-based approaches. This work applies CTC on the Arabic language with the purpose of investigating whether these advantages/disadvantages exists for the Arabic language as well. The results are encouraging as they show the viability of using CTC for Arabic TC.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call