Feature selection with a measure of deviations from Poisson in text categorization

Hiroshi Ogura,Hiromi Amano,Masato Kondo

doi:10.1016/j.eswa.2008.08.006

Abstract

To improve the performance of automatic text classification, it is desirable to reduce a high dimensionality of the feature space. In this paper, we propose a new measure for selecting features, which estimates term importance based on how largely the probability distribution of each term deviates from the standard Poisson distribution. In information retrieval literatures, the deviation from Poisson has been used as a measure for weighting keywords and this motivates us to adopt the deviation from Poisson as a measure for feature selection in text classification tasks. The proposed measure is constructed so as to have the same computational complexity with other standard measures used for feature selection. To test the effectiveness of our method, we conducted evaluation experiments on Reuters-21578 corpus with support vector machine and k-NN classifiers. In the experiments, we performed binary classifications to determine whether each of the test documents belongs to a certain target category or not. For the target category, each of the top 10 categories of Reuters-21578 was used because of enough numbers of training and test documents. Four measures were used for feature selection; information gain (IG), χ 2 -statistic, Gini index and the proposed measure in this work. Both the proposed measure and Gini index proved to be better than IG and χ 2 -statistic in terms of macro-averaged and micro-averaged values of F 1 , especially at higher vocabulary reduction levels.

Full Text