Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements

Emma J Cooke,Paul Dw Kirk,David L Wild,Robert Darkins,Richard S Savage

doi:10.1186/1471-2105-12-399

Emma J Cooke, Paul Dw Kirk + Show 3 more

Open Access

https://doi.org/10.1186/1471-2105-12-399

Copy DOI

Journal: BMC bioinformatics	Publication Date: Oct 13, 2011
Citations: 96	License type: cc-by

Affiliation: University of Warwick

Abstract

BackgroundPost-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Time series experiments have become increasingly common, necessitating the development of novel analysis tools that capture the resulting data structure. Outlier measurements at one or more time points present a significant challenge, while potentially valuable replicate information is often ignored by existing techniques.ResultsWe present a generative model-based Bayesian hierarchical clustering algorithm for microarray time series that employs Gaussian process regression to capture the structure of the data. By using a mixture model likelihood, our method permits a small proportion of the data to be modelled as outlier measurements, and adopts an empirical Bayes approach which uses replicate observations to inform a prior distribution of the noise variance. The method automatically learns the optimum number of clusters and can incorporate non-uniformly sampled time points. Using a wide variety of experimental data sets, we show that our algorithm consistently yields higher quality and more biologically meaningful clusters than current state-of-the-art methodologies. We highlight the importance of modelling outlier values by demonstrating that noisy genes can be grouped with other genes of similar biological function. We demonstrate the importance of including replicate information, which we find enables the discrimination of additional distinct expression profiles.ConclusionsBy incorporating outlier measurements and replicate values, this clustering algorithm for time series microarray data provides a step towards a better treatment of the noise inherent in measurements from high-throughput genomic technologies. Timeseries BHC is available as part of the R package 'BHC' (version 1.5), which is available for download from Bioconductor (version 2.9 and above) via http://www.bioconductor.org/packages/release/bioc/html/BHC.html?pagewanted=all.

Highlights

Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites
For hierarchical clustering (HCL), SSClust and the method of Zhou et al, the number of clusters was fixed at the number obtained for Bayesian Hierarchical Clustering (BHC)-SE
We have presented an extension to the BHC algorithm [14] for time-series microarray data, using a likelihood based on Gaussian process regression, which learns the optimum number of clusters given the data, and which incorporates non-uniformly sampled time points

Summary

Introduction

Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Post-genomic molecular biology has resulted in an explosion of typically high dimensional, structured data from new technologies for transcriptomics, proteomics and metabolomics Often this data measures readouts from large sets of genes, proteins or metabolites over a time course rather than at a single time point. Whilst there are many clustering algorithms available which allow genes to be grouped according to changes in expression level, the standard approaches to clustering use pairwise similarity measures, such as correlation or Euclidean distance, to cluster genes on the basis of their expression pattern These algorithms disregard temporal information: the implicit assumption is that the observations for each gene are independent and identically distributed (iid) and are invariant with respect to the order of the observations. This was demonstrated in the classic paper of Eisen et al [2], who observed that the biologically meaningful clusters obtained by hierarchical clustering of S. cerevisiae microarray time series data, using a correlation distance metric, disappeared when the observations within each sequence were randomly permuted

Methods

Results

Conclusion