Accelerating Bayesian Hierarchical Clustering of Time Series Data with a Randomised Algorithm

Robert Darkins,David L Wild,Emma J Cooke,Zoubin Ghahramani,Richard S Savage,Paul D W Kirk,Magnus Rattray

doi:10.1371/journal.pone.0059795

Robert Darkins, David L Wild + Show 5 more

Open Access

https://doi.org/10.1371/journal.pone.0059795

Copy DOI

Journal: PloS one	Publication Date: Apr 2, 2013
Citations: 39	License type: CC BY 4.0

Affiliation: University of Warwick, University of Cambridge

Abstract

We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor.org/packages/2.10/bioc/html/BHC.html. We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/randomisedbhc/.

Highlights

Many scientific disciplines are becoming data intensive
We have presented a randomised algorithm for the Bayesian Hierarchical Clustering (BHC) clustering method
N The randomised BHC algorithm can be used to obtain a substantial speed-up over the greedy BHC algorithm

Summary

Introduction

Many scientific disciplines are becoming data intensive. These subjects require the development of new and innovative statistical algorithms to fully utilise these data. Time series clustering methods in particular have become popular in many disciplines such as clustering stocks with different price dynamics in finance [1], clustering regions with different growth patterns [2] or signal clustering [3]. New and increasingly affordable measurement technologies such as microarrays have led to an explosion of high-quality data for transcriptomics, proteomics and metabolomics. These data are generally high-dimensional and are often time-courses rather than single time point measurements

Methods

Results

Conclusion