Performance Evaluation of Applying N-Gram Based Naïve Bayes Classifier for Hierarchical Classification

Jayna Shah

doi:10.1109/iccmc.2019.8819751

Abstract

Text classification is a process of allocating one or more class label to a text document. If the text classification problem has too many categories, and there are certain categories with less number of training documents, text classification task becomes difficult. Recall will be less for categories with less number of training documents. To handle text classification problem with too many categories and to take into consideration parent-child/sibling relationships between categories in user profile and document profile for content-based filtering, hierarchical classification is better approach. The main issue with hierarchical classification is error propagation. The error that occurs at early level in hierarchy will carry forward to all the levels below it. So, misclassification at early level in hierarchy needs to be reduced. Term ambiguity may be one of the reasons for classification error. Naive Bayes classification method is mostly used in text classification problem as it takes less time for training and testing. Naive Bayes model considers that terms are not dependent on each other for a given class. For data where terms are dependent on each other, performance of naive Bayes is degraded. In this paper, word-level n-gram based Multinomial Naive Bayes classification method is combined with hierarchical classification to reduce misclassification that occur at early level in hierarchy & improve content-based filtering. Proposed algorithm also suggests a way to reduce execution time requirements for calculating probabilities of terms for n-gram naive bayes model.

Full Text