Abstract

BackgroundCluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible.ResultsThis paper presents a novel Beta-Gaussian mixture model (BGMM) for clustering genes based on Gaussian distributed and beta distributed data. The proposed BGMM can be viewed as a natural extension of the beta mixture model (BMM) and the Gaussian mixture model (GMM). The proposed BGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework, which provides a more efficient use of multiple data sources than methods that analyze different data sources separately. Moreover, BGMM provides an exceedingly flexible modeling framework since many data sources can be modeled as Gaussian or beta distributed random variables, and it can also be extended to integrate data that have other parametric distributions as well, which adds even more flexibility to this model-based clustering framework. We developed three types of estimation algorithms for BGMM, the standard expectation maximization (EM) algorithm, an approximated EM and a hybrid EM, and propose to tackle the model selection problem by well-known model selection criteria, for which we test the Akaike information criterion (AIC), a modified AIC (AIC3), the Bayesian information criterion (BIC), and the integrated classification likelihood-BIC (ICL-BIC).ConclusionPerformance tests with simulated data show that combining two different data sources into a single mixture joint model greatly improves the clustering accuracy compared with either of its two extreme cases, GMM or BMM. Applications with real mouse gene expression data (modeled as Gaussian distribution) and protein-DNA binding probabilities (modeled as beta distribution) also demonstrate that BGMM can yield more biologically reasonable results compared with either of its two extreme cases. One of our applications has found three groups of genes that are likely to be involved in Myd88-dependent Toll-like receptor 3/4 (TLR-3/4) signaling cascades, which might be useful to better understand the TLR-3/4 signal transduction.

Highlights

  • Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis

  • We first compared the performance of Beta-Gaussian mixture model (BGMM) with different expectation maximization (EM) algorithms by artificial data, according to which one EM was chosen for later simulations

  • Performance test of BGMM with artificial data To evaluate the overall performance of a clustering method, we developed one scoring system to evaluate the clustering accuracy when dealing with artificial data

Read more

Summary

Introduction

Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. In the field of gene clustering, gene expression data has been widely used assuming that genes that have similar expression patterns should have similar cellular functions and are likely to be involved in the same cellular processes [1]. This assumption might be too simplistic considering the complexity of real biological systems. We developed a clustering algorithm that can cluster genes based on beta distributed and Gaussian distributed data, which are represented by protein-DNA binding probabilities (predictions from a software [2]) and gene expression data, respectively, in a real case study. Other possible data sources that can be naturally modeled with beta distributions include e.g. correlations [3] and pair-wise and multiple sequence similarities [4], and other possible Gaussian distributed data sources include various other microarray-based measurements

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call