Determining distinct clusters in gene expression data using similarity in principal component subspaces

Sudhakar Jonnalagadda,Rajagopalan Srinivasan

doi:10.1007/s12572-012-0055-1

Abstract

Clustering is routinely used in gene expression data analysis to mine groups of co-expressed genes. Commonly used clustering algorithms require the user to specify the number of clusters a priori. We have developed a method that identifies, from a set of candidate partitions, the one with the maximal number of distinct clusters. Principal component analysis is used to characterize each cluster by its dominant eigenvectors that describe the correlation between the constituent genes. Similarity between each pair of clusters is measured as the angle between their principal component subspaces. A cluster is deemed to be ‘distinct’ if it shows low similarity to all other clusters in that partition. The method assigns each candidate partition a cumulative measure of the distinctness of all the clusters, called the Net Principal Subspace Information (NEPSI) Index. A candidate partition with the highest NEPSI index value has the maximal number of distinct clusters and is selected as the ‘best’. We illustrate the efficacy of the proposed method using two gene expression datasets and two different clustering algorithms—k-means and model-based clustering. A comparison of the results with those from Bayesian Information Criterion is also given.

Full Text