Semi supervised ensemble clustering algorithm for high dimensional genomic data

P Krishnakumari,K Vivekanandan

doi:10.1504/ijrapidm.2009.029384

Abstract

Clustering high-dimensional spaces is a difficult problem which is recurrent in many domains, e.g., in computational biology. Developing effective clustering methods for high dimensional datasets is a challenging problem due to the curse of dimensionality. This paper presents an efficient scalable clustering algorithm designed for high-dimensional data which combines the ideas of linear discriminant analysis (LDA) based on PCA feature extraction along with K-means algorithm to select the most discriminative subspace. Initially, K-means clustering is used to generate class labels and LDA is used for subspace selection towards highest variance and the algorithm is designed to reduce the sum squared errors as much as possible for the partitions, while at the same time keep the partitions far apart as possible. The clustering process is thus, integrated with the subspace selection process based on LDA and the data are then simultaneously clustered while the feature subspaces are selected. Finally, clustering instances are aggregated to generate final clusters based on agglomerative clustering. For medical data, all the dimensions are necessary and the proposed method covers all the dimensions efficiently. Real datasets show that the proposed method outperforms existing methods for clustering high-dimensional genomic data in terms of accuracy.

Full Text