Abstract

The problem of feature selection on unbalanced text data is a difficult problem to be solved. In view of the above problems, this paper analyzes the distribution of the feature items in the class and the class and the difference of the document under the unbalanced data set. The research is based on the word frequency probability and the document probability measurement feature and the document in the unbalanced data this paper proposes a CHI feature selection method based on probabilistic variance, which improves the traditional chi-square statistical model by introducing the intra-class word frequency probability factor, inter-class document probability concentration factor and intra-class uniformity factor. The experiment proves the effectiveness and feasibility of the method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call