Large-Scale Hierarchical Text Classification Based on Path Semantic Vector and Prior Information

Feng Gao,Yiping Zhong,Weiming Fu,Danfeng Zhao

doi:10.1109/cis.2009.38

Abstract

Although an improvement of hierarchical text classification can be achieved by using hierarchical structure information, existing hierarchical text classification methods suffer from two problems: data skew (especially in large-scale hierarchy) and error propagation. In this paper, we first define the concept of path-based semantic vector for the presentation of categories. Then a set of additional reliable training data for data-sparse categories can be retrieved based on such representation and particular similarity metrics. This training data enhancement strategy is classifier independent and can improve the classification of categories without adequate training data. Second, we propose the occurrence probability based strategy for hierarchical text classification which can reduce error propagation efficiently. Cooccurrence probability is then introduced to correct the errors occurred on higher levels of the hierarchy. Extensive experiments show that our hierarchical classification strategies perform well on ODP dataset, even in the condition of having few training data.

Full Text