Semi-supervised consensus clustering for gene expression data analysis.

Yunli Wang,Youlian Pan

doi:10.1186/1756-0381-7-7

Abstract

BackgroundSimple clustering methods such as hierarchical clustering and k-means are widely used for gene expression data analysis; but they are unable to deal with noise and high dimensionality associated with the microarray gene expression data. Consensus clustering appears to improve the robustness and quality of clustering results. Incorporating prior knowledge in clustering process (semi-supervised clustering) has been shown to improve the consistency between the data partitioning and domain knowledge.MethodsWe proposed semi-supervised consensus clustering (SSCC) to integrate the consensus clustering with semi-supervised clustering for analyzing gene expression data. We investigated the roles of consensus clustering and prior knowledge in improving the quality of clustering. SSCC was compared with one semi-supervised clustering algorithm, one consensus clustering algorithm, and k-means. Experiments on eight gene expression datasets were performed using h-fold cross-validation.ResultsUsing prior knowledge improved the clustering quality by reducing the impact of noise and high dimensionality in microarray data. Integration of consensus clustering with semi-supervised clustering improved performance as compared to using consensus clustering or semi-supervised clustering separately. Our SSCC method outperformed the others tested in this paper.

Highlights

Simple clustering methods such as hierarchical clustering and k-means are widely used for gene expression data analysis; but they are unable to deal with noise and high dimensionality associated with the microarray gene expression data
The performance of supervised consensus clustering (SSCC) was influenced by amount of prior knowledge, consensus function and base clustering
Comparisons of SSCC, supervised spectral clustering (SSC), linkbased cluster ensemble (LCE) and k-means was performed by using one-way ANOVA with Bonferroni correction (p < 0.05) on normalized mutual information (NMI) and adjusted rand index (ARI) (Table 3 and Additional file 1)

Summary

Introduction

Simple clustering methods such as hierarchical clustering and k-means are widely used for gene expression data analysis; but they are unable to deal with noise and high dimensionality associated with the microarray gene expression data. Incorporating prior knowledge in clustering process (semi-supervised clustering) has been shown to improve the consistency between the data partitioning and domain knowledge Simple clustering methods such as agglomerative hierarchical clustering and k-means have been widely used on gene expression data analysis. The second step takes the cluster ensemble as input and combines the solutions through a consensus function, and produces final partitioning as the final output, known as final clustering. Some consensus clustering methods used a pairwise similarity matrix of instances to combine multiple clustering solutions [1,2], others used associations between instances and clusters in the consensus matrix [4] These consensus clustering algorithms usually outperform single clustering algorithms on gene expression datasets [1,2,3,4]

Results

Discussion

Conclusion