Abstract

BackgroundPost-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Time series experiments have become increasingly common, necessitating the development of novel analysis tools that capture the resulting data structure. Outlier measurements at one or more time points present a significant challenge, while potentially valuable replicate information is often ignored by existing techniques.ResultsWe present a generative model-based Bayesian hierarchical clustering algorithm for microarray time series that employs Gaussian process regression to capture the structure of the data. By using a mixture model likelihood, our method permits a small proportion of the data to be modelled as outlier measurements, and adopts an empirical Bayes approach which uses replicate observations to inform a prior distribution of the noise variance. The method automatically learns the optimum number of clusters and can incorporate non-uniformly sampled time points. Using a wide variety of experimental data sets, we show that our algorithm consistently yields higher quality and more biologically meaningful clusters than current state-of-the-art methodologies. We highlight the importance of modelling outlier values by demonstrating that noisy genes can be grouped with other genes of similar biological function. We demonstrate the importance of including replicate information, which we find enables the discrimination of additional distinct expression profiles.ConclusionsBy incorporating outlier measurements and replicate values, this clustering algorithm for time series microarray data provides a step towards a better treatment of the noise inherent in measurements from high-throughput genomic technologies. Timeseries BHC is available as part of the R package 'BHC' (version 1.5), which is available for download from Bioconductor (version 2.9 and above) via http://www.bioconductor.org/packages/release/bioc/html/BHC.html?pagewanted=all.

Highlights

  • Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites

  • For hierarchical clustering (HCL), SSClust and the method of Zhou et al, the number of clusters was fixed at the number obtained for Bayesian Hierarchical Clustering (BHC)-SE

  • We have presented an extension to the BHC algorithm [14] for time-series microarray data, using a likelihood based on Gaussian process regression, which learns the optimum number of clusters given the data, and which incorporates non-uniformly sampled time points

Read more

Summary

Introduction

Post-genomic molecular biology has resulted in an explosion of data, providing measurements for large numbers of genes, proteins and metabolites. Post-genomic molecular biology has resulted in an explosion of typically high dimensional, structured data from new technologies for transcriptomics, proteomics and metabolomics Often this data measures readouts from large sets of genes, proteins or metabolites over a time course rather than at a single time point. Whilst there are many clustering algorithms available which allow genes to be grouped according to changes in expression level, the standard approaches to clustering use pairwise similarity measures, such as correlation or Euclidean distance, to cluster genes on the basis of their expression pattern These algorithms disregard temporal information: the implicit assumption is that the observations for each gene are independent and identically distributed (iid) and are invariant with respect to the order of the observations. This was demonstrated in the classic paper of Eisen et al [2], who observed that the biologically meaningful clusters obtained by hierarchical clustering of S. cerevisiae microarray time series data, using a correlation distance metric, disappeared when the observations within each sequence were randomly permuted

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call