Abstract

This paper proposes a new approach of feature selection based on the independent measure between features for text categorization. A fundamental hypothesis that occurrence of the terms in documents is independent of each other, widely used in the probabilistic models for text categorization (TC), is discussed. However, the basic hypothesis is incomplete for independence of feature set. From the view of feature selection, a new independent measure between features is designed, by which a feature selection algorithm is given to obtain a feature subset. The selected subset is high in relevance with category and strong in independence between features, satisfies the basic hypothesis at maximum degree. Compared with other traditional feature selection method in TC (which is only taken into the relevance account), the performance of feature subset selected by our method is prior to others with experiments on the benchmark dataset of 20 Newsgroups.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call