Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.

Susmita Datta,Somnath Datta

doi:10.1186/1471-2105-7-397

Abstract

BackgroundA cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. In most cases, a post hoc analysis is done to see if the genes in the same clusters can be functionally correlated. While past successes of such analyses have often been reported in a number of microarray studies (most of which used the standard hierarchical clustering, UPGMA, with one minus the Pearson's correlation coefficient as a measure of dissimilarity), often times such groupings could be misleading. More importantly, a systematic evaluation of the entire set of clusters produced by such unsupervised procedures is necessary since they also contain genes that are seemingly unrelated or may have more than one common function. Here we quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Such a reference set may come from prior biological knowledge specific to a microarray study or may be formed using the growing databases of gene ontologies (GO) for the annotated genes of the relevant species.ResultsIn this paper, we introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters. The first measure is a biological homogeneity index (BHI). As the name suggests, it is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and also for comparing the performance of a number of competing clustering algorithms applied to the same data set. The second performance measure is called a biological stability index (BSI). For a given clustering algorithm and an expression data set, it measures the consistency of the clustering algorithm's ability to produce biologically meaningful clusters when applied repeatedly to similar data sets. A good clustering algorithm should have high BHI and moderate to high BSI. We evaluated the performance of ten well known clustering algorithms on two gene expression data sets and identified the optimal algorithm in each case. The first data set deals with SAGE profiles of differentially expressed tags between normal and ductal carcinoma in situ samples of breast cancer patients. The second data set contains the expression profiles over time of positively expressed genes (ORF's) during sporulation of budding yeast. Two separate choices of the functional classes were used for this data set and the results were compared for consistency.ConclusionFunctional information of annotated genes available from various GO databases mined using ontology tools can be used to systematically judge the results of an unsupervised clustering algorithm as applied to a gene expression data set in clustering genes. This information could be used to select the right algorithm from a class of clustering algorithms for the given data set.

Highlights

A cluster analysis is the most commonly performed procedure on a set of gene expression profiles
We introduce two performance measures for evaluating the results of a clustering algorithm in its ability to produce biologically meaningful clusters
The first measure is a biological homogeneity index (BHI). It is a measure of how biologically homogeneous the clusters are. This can be used to quantify the performance of a given clustering algorithm such as UPGMA in grouping genes for a particular data set and for comparing the performances of a number of competing clustering algorithms applied to the same data set

Summary

Introduction

A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. We quantify the performance of a given unsupervised clustering algorithm applied to a given microarray study in terms of its ability to produce biologically meaningful clusters using a reference set of functional classes. Clustering of genes on the basis of expression profiles is a frequently, if not always, performed operation in analyzing the results of a microarray or SAGE study. Often times it is taken as a first step in understanding how a class of genes act in consort during a biological process. The hierarchical clustering method UPGMA [4] is used most often with microarray data sets (partly due to its early integration into existing software), the following algorithms are generally considered to be solid performers in the clustering world and are freely available through various R [5] libraries: a partition method called K-means [6], a divisive clustering method Diana [7], a fuzzy logic based method Fanny [7], neural network based methods SOM (self-organizing maps, [8]) and SOTA (self-organising tree algorithm, [9]) and a normal mixture model based clustering [10]

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Aug 31, 2006
Citations: 192	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Inferring S-system models of genetic networks from a time-series real data set of gene expression profiles
Hui-Ling Huang ... Shinn-Ying Ho
-
Hui-Ling Huang, et. al.Hui-Ling Huang ... Shinn-Ying Ho
01 Jun 2008
01 Jun 2008

A fuzzy relational clustering algorithm based on a dissimilarity measure extracted from data.
P Corsini ... F Marcelloni
IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society | VOL. 34
P Corsini, et. al.P Corsini ... F Marcelloni
01 Feb 2004
IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics : a publication of the IEEE Systems, Man, and Cybernetics Society | VOL. 34

Structure Identification for Force-Induced Reaction Using Single-Molecule Conductance Measurement
Jueting Zheng ... Gang Dong
CCS Chemistry | VOL. 5
Jueting Zheng, et. al.Jueting Zheng ... Gang Dong
22 Oct 2022
CCS Chemistry | VOL. 5

A framework for ontology-driven subspace clustering
Jinze Liu ... Wei Wang
-
Jinze Liu, et. al.Jinze Liu ... Wei Wang
22 Aug 2004
22 Aug 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics