Abstract

We present a novel nonparametric Bayesian approach for performing cluster analysis in a context where observational units have data arising from multiple sources. Our approach uses a particle Gibbs sampler for inference in which cluster allocations are jointly updated using a conditional particle filter within a Gibbs sampler, improving the mixing of the MCMC chain. We develop several approaches to improving the computational performance of our algorithm. These methods can achieve greater than an order-of-magnitude improvement in performance at no cost to accuracy and can be applied more broadly to Bayesian inference for mixture models with a single dataset. We apply our algorithm to the discovery of risk cohorts amongst 243 patients presenting with kidney renal clear cell carcinoma, using samples from the Cancer Genome Atlas, for which there are gene expression, copy number variation, DNA methylation, protein expression and microRNA data. We identify 4 distinct consensus subtypes and show they are prognostic for survival rate (p < 0.0001).

Highlights

  • Cluster analysis can broadly be described as the task of inferring an underlying group structure in a dataset

  • These application areas pose issues not typically encountered in other contexts, owing in particular to the fact that each unit of observation may have data arising from multiple data sources, e.g. gene expression, DNA methylation, or copy number variations

  • We propose a novel integrative clustering algorithm built within the framework of multiple dataset integration (MDI) (Kirk et al 2012), a flexible model-based integrative clustering algorithm which facilitates the sharing of information between datasets

Read more

Summary

Introduction

Cluster analysis can broadly be described as the task of inferring an underlying group structure in a dataset. In analysing genomic data we may aim to infer risk cohorts among patients suffering particular diseases given their genetic make-up, or we may look to infer groups of genes to help gain an understanding of their function. These application areas pose issues not typically encountered in other contexts, owing in particular to the fact that each unit of observation (i.e. patients in the former example) may have data arising from multiple data sources, e.g. gene expression, DNA methylation, or copy number variations. These data sources each give complementary, but differing, views of the underlying processes and, it is vital that analyses of such data can encompass these data sources in a single, integrative analysis

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call