Bayesian mixture modeling using a hybrid sampler with application to protein subfamily identification

Youyi Fong ,Jon Wakefield ,Kenneth Rice

doi:10.1093/biostatistics/kxp033

Abstract

Predicting protein function is essential to advancing our knowledge of biological processes. This article is focused on discovering the functional diversification within a protein family. A Bayesian mixture approach is proposed to model a protein family as a mixture of profile hidden Markov models. For a given mixture size, a hybrid Markov chain Monte Carlo sampler comprising both Gibbs sampling steps and hierarchical clustering-based split/merge proposals is used to obtain posterior inference. Inference for mixture size concentrates on comparing the integrated likelihoods. The choice of priors is critical with respect to the performance of the procedure. Through simulation studies, we show that 2 priors that are based on independent data sets allow correct identification of the mixture size, both when the data are homogeneous and when the data are generated from a mixture. We illustrate our method using 2 sets of real protein sequences.

Full Text