Fast approximate hierarchical clustering using similarity heuristics

Meelis Kull,Jaak Vilo

doi:10.1186/1756-0381-1-9

Abstract

BackgroundAgglomerative hierarchical clustering (AHC) is a common unsupervised data analysis technique used in several biological applications. Standard AHC methods require that all pairwise distances between data objects must be known. With ever-increasing data sizes this quadratic complexity poses problems that cannot be overcome by simply waiting for faster computers.ResultsWe propose an approximate AHC algorithm HappieClust which can output a biologically meaningful clustering of a large dataset more than an order of magnitude faster than full AHC algorithms. The key to the algorithm is to limit the number of calculated pairwise distances to a carefully chosen subset of all possible distances. We choose distances using a similarity heuristic based on a small set of pivot objects. The heuristic efficiently finds pairs of similar objects and these help to mimic the greedy choices of full AHC. Quality of approximate AHC as compared to full AHC is studied with three measures. The first measure evaluates the global quality of the achieved clustering, while the second compares biological relevance using enrichment of biological functions in every subtree of the clusterings. The third measure studies how well the contents of subtrees are conserved between the clusterings.ConclusionThe HappieClust algorithm is well suited for large-scale gene expression visualization and analysis both on personal computers as well as public online web applications. The software is available from the URL

Highlights

Agglomerative hierarchical clustering (AHC) is a common unsupervised data analysis technique used in several biological applications
The HappieClust algorithm is well suited for large-scale gene expression visualization and analysis both on personal computers as well as public online web applications
The software is available from the URL http://www.quretec.com/HappieClust

Summary

Introduction

Agglomerative hierarchical clustering (AHC) is a common unsupervised data analysis technique used in several biological applications. Standard AHC methods require that all pairwise distances between data objects must be known. Various types of biological data resulting from highthroughput experiments require analysis, often consisting of many steps. One possible starting point of interaction is showing an overview of the data to the user, frequently achieved using clustering. We concentrate on hierarchical methods that model the data in a tree structure and leave more freedom to the user. The most well-known hierarchical clustering method is agglomerative hierarchical clustering (AHC). AHC treats each data object as a separate cluster. The clustering tree (dendrogram) is built from leaves towards root, where merging of clusters is depicted as a common parent (page number not for citation purposes)

Methods

Results

Discussion

Conclusion