Informative Feature Clustering and Selection for Gene Expression Data

Yuqi Yang,Renjie Chen,Wenwen Gu,Zhihang Luo,Qingyao Wu,Pengshuai Yin

doi:10.1109/access.2019.2952548

Abstract

Feature selection aims to remove irrelevant and redundant features from input data. For gene expression, selecting important genes from gene expression data is essential since the gene expression data often consists of a large number of genes. However, the commonly-used feature selection methods are usually biased toward the highest rank features, and the correlation of these selected features may be high. To overcome these problems, we propose an informative feature clustering and selection method to select informative and diverse genes from the gene expression data. The method consists of two steps. In the first step, a feature clustering (FC) method is designed to cluster total genes into several gene clusters. In FC, a set of feature weights are computed to respect the importance of each gene, and we sort the genes in different gene clusters based on the feature weights. In the second step, we propose a stratified feature selection (SFS) method to select genes from different gene clusters and combine them to form the final feature set. Experiments on several gene expression data demonstrate the superiority of the proposed method over six popular feature selection methods.

Highlights

Feature selection aims to remove irrelevant and redundant features from high-dimensional data
In these experiments, we compared stratified feature selection (SFS) with 6 supervised feature selection methods which are wildly-used as baselines, including MulInf [38], MRMR [29], Relief-F [23], [26], RFS [28], SVM-RFE-correlation bias reduction (CBR) [36] and UGL [24], to verify the effectiveness of SFS
We propose a Feature Clustering (FC) method to cluster genes into a series of gene clusters

Summary

INTRODUCTION

Feature selection aims to remove irrelevant and redundant features from high-dimensional data. We propose an informative feature clustering and selection method to select important genes in highdimensional gene expression data. For analyzing the high dimensional document data, Ye et al [37] proposed a co-clustering method to cluster it It computes the weight for each feature from mutual information between the documents and features. Inspired by the weighting co-clustering method in [6], [9], we can know that highly correlated features can be cluster into same co-clusters and features in different clusters have low correlations, for the sake of removing redundant features, so we can cluster genes into several disjoint clusters and get the importance of genes and formulate the following objective function. Suppose that FC converges in r iterations, the time cost of FC is O(rnmkl), which is the same as time cost of k-means

INFORMATIVE FEATURE CLUSTERING AND SELECTION METHOD

3: Construct a class indicator matrices U

FEATURE CLUSTERING

STRATIFIED FEATURE SELECTION

CONCLUSION