Clustering Algorithms: On Learning, Validation, Performance, and Applications to Genomics

Lori Dalton,Marcel Brun,Virginia Ballarin

doi:10.2174/138920209789177601

Lori Dalton, Marcel Brun + Show 1 more

Open Access

https://doi.org/10.2174/138920209789177601

Copy DOI

Journal: Current Genomics	Publication Date: Sep 1, 2009
Citations: 107	License type: cc-by

Affiliation: Texas A&M University

Abstract

The development of microarray technology has enabled scientists to measure the expression of thousands of genes simultaneously, resulting in a surge of interest in several disciplines throughout biology and medicine. While data clustering has been used for decades in image processing and pattern recognition, in recent years it has joined this wave of activity as a popular technique to analyze microarrays. To illustrate its application to genomics, clustering applied to genes from a set of microarray data groups together those genes whose expression levels exhibit similar behavior throughout the samples, and when applied to samples it offers the potential to discriminate pathologies based on their differential patterns of gene expression. Although clustering has now been used for many years in the context of gene expression microarrays, it has remained highly problematic. The choice of a clustering algorithm and validation index is not a trivial one, more so when applying them to high throughput biological or medical data. Factors to consider when choosing an algorithm include the nature of the application, the characteristics of the objects to be analyzed, the expected number and shape of the clusters, and the complexity of the problem versus computational power available. In some cases a very simple algorithm may be appropriate to tackle a problem, but many situations may require a more complex and powerful algorithm better suited for the job at hand. In this paper, we will cover the theoretical aspects of clustering, including error and learning, followed by an overview of popular clustering algorithms and classical validation indices. We also discuss the relative performance of these algorithms and indices and conclude with examples of the application of clustering to computational biology.

Highlights

Microarray technology has made available an incredible amount of gene expression data, driving research in several areas including the molecular basis of disease, drug discovery, neurobiology, and others
Microarray data is collected with the goal of either discovering genes associated with some event, predicting outcomes based on gene expression, or discovering sub-classes of diseases
While clustering has been used for decades in image processing and pattern recognition [1,2,3], in recent years it has become a popular technique in genomic studies for extracting this kind of valuable information from massive sets of gene expression data

Summary

Introduction

Microarray technology has made available an incredible amount of gene expression data, driving research in several areas including the molecular basis of disease, drug discovery, neurobiology, and others. Microarray data is collected with the goal of either discovering genes associated with some event, predicting outcomes based on gene expression, or discovering sub-classes of diseases. Clustering applied to genes from microarray data groups together those whose expression levels exhibit similar behavior through the samples. In this context, similarity is taken to indicate possible co-regulation between the genes, but may reveal other processes that relate their expression. The application of clustering in our first goal listed above is founded by the concept of “guilty by association”, where genes with similar expression across samples are assumed to share some underlying mechanism

Objectives

Results

Conclusion