Streaming clustering with Bayesian nonparametric models

Viet Huynh,Dinh Phung

doi:10.1016/j.neucom.2017.02.078

Abstract

Bayesian nonparametric (BNP) models are theoretically suitable for learning streaming data due to their complexity relaxation to growing data observed over time. There is a rich body of literature on developing efficient approximate methods for posterior inferences in BNP models, typically dominated by MCMC. However, very limited work has addressed posterior inference in a streaming fashion, which is important to fully realize the potential of BNP models applied to real-world tasks. The main challenge resides in developing one-pass posterior update which is consistent with the data streamed over time (i.e., data is scanned only once), for which general MCMC methods will fail to address. On the other hand, Dirichlet process-based mixture models are the most fundamental building blocks in the field of BNP. To this end, we develop in this paper a class of variational methods suitable for posterior inference of the Dirichlet process mixture (DPM) models where both the posterior update and data are presented in a streaming setting. We first propose new methods to advance existing variational based inference approaches for BNP to allow the variational distributions growing over time, hence overcoming an important limitation of current methods in imposing parametric, truncated restrictions on the variational distributions. This results in two new methods namely truncation-free variational Bayes (TFVB) and truncation-free maximization expectation (TFME) respectively where the latter further supports hard clustering. These inference methods form the foundation for our streaming inference algorithm where we further adapt the recent Streaming Variational Bayes proposed in Broderick et al. 2013 to our task. To demonstrate our framework for real-world tasks whose datasets are often heterogeneous, we develop one more theoretical extension for our model to handle assorted data where each observation consists of different data types. Our experiments with automatically learning the number of clusters demonstrate the comparable inference capability of our framework in comparison with truncated version variational inference algorithms for both synthetic and real-world datasets. Moreover, an evaluation of streaming learning algorithms with text corpora reveals both quantitative and qualitative efficacy of the algorithms on clustering documents.

Full Text