Feature selection algorithm for hierarchical text classification using Kullback-Leibler divergence

Yao Lifang Yao Lifang,Zhu Huan Zhu Huan,Qin Sijun Qin Sijun

doi:10.1109/icccbda.2017.7951950

Abstract

Text classification, a simple and effective method, is considered as the key technology to deal with and organize a large amount of text data. At present, the simple text classification is unable to meet the increasing of user's demand, hierarchical text classification has received extensive attention and has broad application prospects. Hierarchical feature selection algorithm is the key technology of hierarchical text automatic classification, and the general method mainly aims at the individual feature selection of each class in the class hierarchy, and ignores the correlation between the parent and child class. This paper proposes a feature selection method based on KL divergence, measure the correlation between the class and subclasses by the KL divergence, calculate the correlation between each feature and sub class by Mutual Information method, measure the importance of subclasses characteristics using Term Frequency probability, to select the better discrimination set of features for parent class node. In this paper, we used hierarchical feature selection method and SVM classifiers for the hierarchical text categorization task on two corpora. Experiments showed the algorithm we proposed was effective, compared with the χ2 statistic (CHI), information gain (IG), and mutual information (MI) that were used directly to select hierarchical feature.

Full Text