Clustering of gene expression data: performance and similarity analysis

Longde Yin,Chun-Hsi Huang,Jun Ni

doi:10.1186/1471-2105-7-s4-s19

Longde Yin, Chun-Hsi Huang + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-7-s4-s19

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Dec 1, 2006
Citations: 47	License type: CC BY 2.0

Affiliation: University of Connecticut, University of Iowa

Abstract

BackgroundDNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. The evaluation of feasible and applicable clustering algorithms is becoming an important issue in today's bioinformatics research.ResultsIn this paper we first experimentally study three major clustering algorithms: Hierarchical Clustering (HC), Self-Organizing Map (SOM), and Self Organizing Tree Algorithm (SOTA) using Yeast Saccharomyces cerevisiae gene expression data, and compare their performance. We then introduce Cluster Diff, a new data mining tool, to conduct the similarity analysis of clusters generated by different algorithms. The performance study shows that SOTA is more efficient than SOM while HC is the least efficient. The results of similarity analysis show that when given a target cluster, the Cluster Diff can efficiently determine the closest match from a set of clusters. Therefore, it is an effective approach for evaluating different clustering algorithms.ConclusionHC methods allow a visual, convenient representation of genes. However, they are neither robust nor efficient. The SOM is more robust against noise. A disadvantage of SOM is that the number of clusters has to be fixed beforehand. The SOTA combines the advantages of both hierarchical and SOM clustering. It allows a visual representation of the clusters and their structure and is not sensitive to noises. The SOTA is also more flexible than the other two clustering methods. By using our data mining tool, Cluster Diff, it is possible to analyze the similarity of clusters generated by different algorithms and thereby enable comparisons of different clustering methods.

Highlights

DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression
For a large number of genes (>1000), Self Organizing Tree Algorithm (SOTA) is faster than Hierarchical Clustering (HC)
(page number not for citation purposes) ime of SOTA and Self-Organizing Map (SOM) are proportional to the sample sizes, and the computation using SOTA is faster than the SOM

Summary

Introduction

DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. The technology permits the analysis of gene expression, DNA sequence variation, protein levels, tissues, cells and other chemicals in a massive format [1,2]. Several clustering methods (algorithms) have been proposed for the analysis of gene expression data, such as Hierarchical Clustering (HC) [3], self-organizing maps (SOM) [4], and k-means approaches [5]. The issues of determining the "correct" number of clusters and the choice of "best" algorithm are not yet clear [6]

Objectives

Methods

Results

Conclusion