Feature selection has been extensively applied in statistical pattern recognition as a mechanism for cleaning up the set of features that are used to represent data and as a way of improving the performance of classifiers. Four schemes commonly used for feature selection are Exponential Searches, Stochastic Searches, Sequential Searches, and Best Individual Features. The most popular scheme used in text categorization is Best Individual Features as the extremely high dimensionality of text feature spaces render the other three feature selection schemes time prohibitive. This paper proposes five new metrics for selecting Best Individual Features for use in text categorization. Their effectiveness have been empirically tested on two well- known data collections, Reuters-21578 and 20 Newsgroups. Experimental results show that the performance of two of the five new metrics, Bayesian Rule and F-one Value, is not significantly below that of a good traditional text categorization selection metric, Document Frequency. The performance of another two of these five new metrics, Low Loss Dimensionality Reduction and Relative Frequency Difference, is equal to or better than that of conventional good feature selection metrics such as Mutual Information and Chi-square Statistic.
Read full abstract