On the selection of appropriate distances for gene expression data clustering.

Pablo A Jaskowiak,Ricardo J G B Campello,Ivan G Costa

doi:10.1186/1471-2105-15-s2-s2

Pablo A Jaskowiak, Ricardo J G B Campello + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-15-s2-s2

Copy DOI

Abstract

BackgroundClustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions.Results and conclusionsWe analyze how different distances and clustering methods interact regarding their ability to cluster gene expression, i.e., microarray data. We study 15 distances along with four common clustering methods from the literature on a total of 52 gene expression microarray datasets. Distances are evaluated on a number of different scenarios including clustering of cancer tissues and genes from short time-series expression data, the two main clustering applications in gene expression. Our results support that the selection of an appropriate distance depends on the scenario in hand. Moreover, in each scenario, given the very same clustering method, significant differences in quality may arise from the selection of distinct distance measures. In fact, the selection of an appropriate distance measure can make the difference between meaningful and poor clustering outcomes, even for a suitable clustering method.

Highlights

Clustering is crucial for gene expression data analysis
We include in our analysis four “traditional” proximity measures, i.e., Cosine similarity adapted as distance (COS), Euclidean distance (EUC), Manhattan distance (MAN) and Supreme distance (SUP), the last three being special cases of the Minkowski Distance
Besides the comparison of the distances themselves, it is quite interesting to observe that k-medoids does not provide, in real applications, significant differences when compared to hierarchical methods

Summary

Introduction

Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. A single microarray is capable of determining expression levels for virtually all the genes of a particular biological sample of interest. A frequently used method is clustering, as its unsupervised nature, allows the creation of new hypothesis from gene expression data. The first one is obtained when biological samples are clustered together. In this application scenario the main objective is to detect previously unknown clusters of biological samples, which are usually associated with unknown types of cancer [4]. Once cancer signatures are identified on a genomic level, specific drugs can be developed, improving treatment efficacy while reducing its side effects

Methods

Results

Discussion

Conclusion