Abstract

Text classification is a process of classifying textual contents to a set of predefined classes and categories. As enormous numbers of documents and contextual contents are introduced every day on the Internet, it becomes essential to use text classification techniques for different purposes such as enhancing search retrieval and recommendation systems. A lot of work has been done to study different aspects of English text classification techniques. However, little attention has been devoted to study Arabic text classification due to the difficulty of processing Arabic language. Consequently, in this paper, we propose an enhanced Arabic topic-discovery architecture (EATA) that can use ontology to provide an effective Arabic topic classification mechanism. We have introduced a semantic enhancement model to improve Arabic text classification and the topic discovery technique by utilizing the rich semantic information in Arabic ontology. We rely in this study on the vector space model (term frequency-inverse document frequency (TF-IDF)) as well as the cosine similarity approach to classify new Arabic textual documents.

Highlights

  • Nowadays, text classification has become a vital technique to classify unclassified contents to pre-defined classes

  • We have introduced a semantic model by using a vector space model (term frequency-inverse document frequency (TF-IDF)) and the cosine similarity approach to improve Arabic classification and topic discovery techniques

  • We aim at examining the impact of the proposed semantic clustering mechanism (SCM) on the Arabic classification performance, as well as comparing the performance of the SCM with three baseline methods: Support vector machine (SVM), naive Bayesian (NB), and decision tree (DT)

Read more

Summary

Introduction

Text classification has become a vital technique to classify unclassified contents to pre-defined classes. Such a technique can help in finding interesting information and can enhance decision making techniques. We have used an Arabic ontology, which has been introduced by HAWALAH [20], to propose an Arabic multi-disciplinary ontology from multiple resources. This ontology consists of a number of topics, which are linked with sub-topics. We have introduced a semantic model by using a vector space model (term frequency-inverse document frequency (TF-IDF)) and the cosine similarity approach to improve Arabic classification and topic discovery techniques.

Previous Work
Evaluation
Evaluation Metrics
Evaluation Process
Evaluation Results
Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.