Text Feature Selection Based on Class Subspace

Xiaofei Zhou,Li Guo,Tianyi Wang,Yue Hu

doi:10.1109/icdmw.2014.99

Abstract

For text data, feature dimension reduction is a very significant and important work for simplifying document representation and enhancing computation of learning algorithm. There are usually two main dimension reduction strategies, feature extraction and feature selection. Feature extraction is to create new features to represent documents, whereas feature selection will return a subset of words as features. Comparing two strategies, feature extraction has powerful capacity in reducing dimensionality, but it will lost intuitive semantic for documents. Feature selection has perfect interpretability for text contents, and specially is significant for text dimension reduction, but it is still a difficult work to design a suitable measure for feature evaluation. In this paper we present a new feature selection method called class subspace feature selection (CSFS) method. We utilize PCA feature extraction method to capture lower dimensional class subspaces, and then base on the subspaces to choose the most relevant features to the subspaces. The feature words chosen by our method can approximate the class subspace which has lower dimensionality and also owns intuitive semantic understanding for the class. The experimental results on three text data sets show the effectiveness of our proposal feature selection method.

Full Text