PCGEN: A Practical Approach to Projected Clustering and its Application to Gene Expression Data

Mohamed Bouguessa,Shengrui Wang

doi:10.1109/cidm.2007.368939

Abstract

Clustering samples in gene expression data has always been a major challenge because of the high dimensionality of the input space (typically in the tens of thousands) and the small number of samples (typically less than a hundred). Moreover, clusters may hide in subspaces with very low dimensionalities. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the full-dimensional space. These challenges motivate our effort to propose a new and efficient partitional distance-based projected clustering algorithm for clustering samples in gene expression data. Our algorithm is capable of detecting projected clusters of extremely low dimensionality embedded in a high-dimensional space and avoids the computation of the distance in the full-dimensional space. The suitability of our proposal has been demonstrated through an empirical study using public microarray datasets.

Full Text