A novel unsupervised feature selection method for bioinformatics data sets through feature clustering

Guangrong Li Guangrong Li,Xiajiong Shen Xiajiong Shen,Xiaohua Hu Xiaohua Hu,Zhoujun Li Zhoujun Li,Xin Chen Xin Chen

doi:10.1109/grc.2008.4664788

Guangrong Li Guangrong Li, Xiajiong Shen Xiajiong Shen + Show 3 more

https://doi.org/10.1109/grc.2008.4664788

Copy DOI

Abstract

Many feature selection methods have been proposed and most of them are in the supervised learning paradigm. Recently unsupervised feature selection has attracted a lot of attention especially in bioinformatics and text mining. So far, supervised feature selection and unsupervised feature selection method are studied and developed separately. A subset selected by a supervised feature selection method may not be a good one for unsupervised learning and vice verse. In bioinformatics research, however it is very common to perform clustering and classification iteratively for the same data sets, especially in gene expression analysis, thus it is very desirable to have a feature selection method which works well for both unsupervised learning and supervised learning. In this paper we propose a novel feature selection algorithm through feature clustering. Our algorithm does not need the class label information in the data set and is suitable for both supervised learning and unsupervised learning. Our algorithm groups the features into different clusters based on feature similarity, so that the features in the same clusters are similar to each other. A representative feature is selected from each cluster, thus reduces the feature redundancy. Our feature selection algorithm uses feature similarity for feature redundancy reduction but requires no feature search, works very well for high dimensional data set. We test our algorithm on some biological data sets for both clustering and classification analysis and the results indicates that our FSFC algorithm can significantly reduce the original data sets without scarifying the quality of clustering and classification.

Full Text