Stochastic Blockmodel with Cluster Overlap, Relevance Selection, and Similarity-Based Smoothing

Joyce Jiyoung Whang,Piyush Rai,Inderjit S Dhillon

doi:10.1109/icdm.2013.156

Abstract

Stochastic block models provide a rich, probabilistic framework for modeling relational data by expressing the objects being modeled in terms of a latent vector representation. This representation can be a latent indicator vector denoting the cluster membership (hard clustering), a vector of cluster membership probabilities (soft clustering), or more generally a real-valued vector (latent space representation). Recently, a new class of overlapping stochastic block models has been proposed where the idea is to allow the objects to have hard memberships in multiple clusters (in form of a latent binary vector). This aspect captures the properties of many real-world networks in domains such as biology and social networks where objects can simultaneously have memberships in multiple clusters owing to the multiple roles they may have. In this paper, we improve upon this model in three key ways: (1) we extend the overlapping stochastic block model to the bipartite graph case which enables us to simultaneously learn the overlapping clustering of two different sets of objects in the graph, the unipartite graph is just a special case of our model, (2) we allow objects (in either set) to not have membership in any cluster by using a relevant object selection mechanism, and (3) we make use of additionally available object features (or a kernel matrix of pair wise object similarities) to further improve the overlapping clustering performance. We do this by explicitly encouraging similar objects to have similar cluster membership vectors. Moreover, using nonparametric Bayesian prior distributions on the key model parameters, we side-step the model selection issues such as selecting the number of clusters a priori. Our model is quite general and can be applied for both overlapping clustering and link prediction tasks in unipartite and bipartite networks (directed/undirected), or for overlapping co-clustering of general binary-valued data. Experiments on synthetic and real-world datasets from biology and social networks demonstrate that our model outperforms several state-of-the-art methods.

Full Text