Entropy-Based Term Weighting Schemes for Text Categorization in VSM

Tao Wang,Zhiwei Cai,Yi Cai,Ho-Fung Leung,Huaqing Min

doi:10.1109/ictai.2015.57

Abstract

Term weighting schemes have been widely used in information retrieval and text categorization models. In this paper, we first investigate into the limitations of several state-of-the-art term weighting schemes in the context of text categorization tasks. Considering that category-specific terms are more useful to discriminate different categories, and these terms tend to have smaller entropy with respect to these categories, we then explore the relationship between a term's discriminating power and its entropy with respect to a set of categories. To this end, we propose two entropy-based term weighting schemes (i.e., tf.dc and tf.bdc) which measure the discriminating power of a term based on its global distributional concentration in the categories of a corpus. To demonstrate the effectiveness of the proposed term weighting schemes, we compare them with seven state-of-the-art schemes on a long-text corpus and a short-text corpus respectively. Our experimental results show that the proposed schemes outperform the state-of-the-art schemes in text categorization tasks with KNN and SVM.

Full Text