Abstract

Arabic text classification methods have emerged as a natural result of the existence of a massive amount of varied textual information (written in Arabic language) on the web. In most text classification processes, feature selection is crucial task since it highly affects the classification accuracy. Generally, two types of features could be used: Statistical based features and semantic and concept features. The main interest of this paper is to specify the most effective semantic and concept features on Arabic text classification process. In this study, two novel features that use lexical, semantic and lexico-semantic relations of Arabic WordNet (AWN) ontology are suggested. The first feature set is List of Pertinent Synsets (LoPS), which is list of synsets that have a specific relation with the original terms. The second feature set is List of Pertinent Words (LoPW), which is list of words that have a specific relation with the original terms. Fifteen different relations (defined in AWN ontology) are used with both proposed features. Naïve Bayes classifier is used to perform the classification process. The experimental results, which are conducted on BBC Arabic dataset, ‎show that using LoPS feature set improves the accuracy of Arabic text ‎classification compared with the well-known Bag-of-Word feature and the ‎recent Bag-of-Concept (synset) features. Also, it was found that LoPW (especially with related-to relation) improves the classification accuracy compared with LoPS, Bag-of-Word and Bag-of-Concept.

Highlights

  • The evolution of the Internet has led to increased availability of digital textual information and documents written in different languages

  • To improve Arabic Text Classification (TC), we suggest using lexical, semantic and lexico-semantic relations of Arabic WordNet (AWN) ontology for text classification improvement

  • Two novel features based on lexical, semantic and lexico-semantic relations of Arabic WordNet (AWN) ontology are used with Naïve Bayes classifier to classify Arabic documents

Read more

Summary

Introduction

The evolution of the Internet has led to increased availability of digital textual information and documents written in different languages. Few attempts have been made to develop an ATCS for documents written in other languages, including Arabic Most of these attempts are based on statistical approaches (applied on bag of words) that produce inaccurate results. Arabic WordNet (AWN) is considered one of the best semantic and lexical thesauruses for Modern Standard Arabic It is widely used in Arabic natural language processing applications (Boudabous et al, 2013; Elberrichi and Abidi, 2012). Because the relations between words outlined by AWN provide semantic information among the concepts and their original words, they are exploited in this research to improve Arabic text classification process. There are limited amount of research, that have tried to improve Text Classification (TC) processes using AWN components, such as using n-grams, synonym and concepts (Alahmadi et al, 2014; Elberrichi and Abidi, 2012).

Related Works
Evaluation Dataset
Conclusion and Future Work
Funding Information
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call