Abstract

BackgroundCluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. A closely related problem is that of selecting a clustering algorithm that is "optimal" in some sense from a rather impressive list of clustering algorithms that currently exist.ResultsIn this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional congruence. Smaller values of these indices indicate better performance for a clustering algorithm. We illustrate this approach using two case studies with publicly available gene expression data sets: one involving a SAGE data of breast cancer patients and the other involving a time course cDNA microarray data on yeast. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated.ConclusionNo single clustering algorithm may be best suited for clustering genes into functional groups via expression profiles for all data sets. The validation measures introduced in this paper can aid in the selection of an optimal algorithm, for a given data set, from a collection of available clustering algorithms.

Highlights

  • Cluster analysis is an integral part of high dimensional data analysis

  • First we consider the expression profiles of 258 significant genes based on their 11 dimensional expression profiles over four normal and seven ductal carcinoma in situ (DCIS) samples [12]

  • Overall, UPGMA appears to be a good performer for this data set as judged by two very different validation measures

Read more

Summary

Introduction

Cluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. Cluster analysis is an exploratory technique that might reveal classes or groups of genes that act in consort during a biological process. A distance or dissimilarity is calculated between the expression vectors of each pair of genes. A statistical clustering algorithm is employed which places a pair of genes in the same cluster if their expression profiles are similar as judged by the distance measure employed. The resulting grouping may be quite varied (see, e.g., [1]; [2,3])

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.