Abstract

Voluminous, conveniently accessible textual documents, created and disseminated by modern information technology, makes automated document organization increasingly important for both individuals and organizations. Many existing techniques rely on document content analysis that classifies new, unlabeled documents by examining the similarity based on the overlap between their important features and the representative features of each document category. However, the performance of feature-based techniques can be significantly hindered by word mismatch and ambiguity problems. As a remedy, this study takes a concept-based approach and propose a text categorization method that incorporates a domain-specific ontology to support automated document categorization more effectively. The proposed method classifies documents according to their respective range of relevant concepts. We empirically evaluate our method versus several prevalent benchmarks that include feature-based k-nearest neighbors (kNN) and semantic-based techniques. The results show the proposed method more effective than the benchmark techniques; it achieves better performances when using a complete concept hierarchy without considering the hierarchical relationships among concepts. The proposed method illustrates how to incorporate a domain-specific ontology to improve document classification. Our method is computationally efficient because it produces a concept space of relatively few dimensionalities and does not require semantic space reconstruction as new documents arrive. Moreover, the relationships and patterns for classifying documents, generated by our method, are explicit and comprehensible.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call