Probability variance CHI feature selection method for unbalanced data

Xiaowen Zhang,Bingfeng Chen

doi:10.1063/1.4992832

Abstract

The problem of feature selection on unbalanced text data is a difficult problem to be solved. In view of the above problems, this paper analyzes the distribution of the feature items in the class and the class and the difference of the document under the unbalanced data set. The research is based on the word frequency probability and the document probability measurement feature and the document in the unbalanced data this paper proposes a CHI feature selection method based on probabilistic variance, which improves the traditional chi-square statistical model by introducing the intra-class word frequency probability factor, inter-class document probability concentration factor and intra-class uniformity factor. The experiment proves the effectiveness and feasibility of the method.

Full Text