Abstract

Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data.The implementation of GBHC is available at https://sites.google.com/site/gaussianbhc/

Highlights

  • Clustering analysis is an important tool in studying genomic data such as gene expression profiles and can be used to infer biological function and regulation of genes

  • We presented a model-based clustering algorithm which employs a Gaussian mixture model to model the gene expression profiles in a Bayesian framework

  • We proposed two variations of the Gaussian BHC (GBHC) algorithm: GBHCTREE and GBHC-NODE, according to two different hyperpara

Read more

Summary

Introduction

Clustering analysis is an important tool in studying genomic data such as gene expression profiles and can be used to infer biological function and regulation of genes. In modern medical research, clustering analysis has been used to identify disease subtypes based on genetic variation [5,6], and to identify a gene expression signature that can be used as a prognostic marker for known disease subtypes [7,8,9]. This aids stratification of patients for personalized medicine. Two common choices of metrics in gene clustering analysis literature are Euclidean distance and Pearson correlation coefficient [15]. The problems of how to identify the number of clusters and the distance metric can be cast as a model selection problem - how to choose a statistical model that best describes the data

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.