Abstract

Online clustering of short text streams has become significant due to the popularity of news and social media platforms. The objective of online clustering is to maintain active topics (clusters) by automatically detecting new topics and forgetting outdated ones. Most existing approaches exploit static and high dimensional semantic term representation of the text to enhance the clustering quality. While these approaches use inference procedures that depend on a fixed batch size to reduce the number of clusters related to a given topic and bring it closer to the actual number of topics. This paper proposes a non-parametric Dirichlet model with episodic inference (EINDM) to cluster the evolving short text stream by introducing a window-based low-dimensional semantic term representation which captures the contextual relationships between words. In addition, an episodic inference procedure is introduced to reduce the cluster sparsity in the model. Furthermore, a novel “word specificity” measure is proposed based on neighborhood terms for evolving contexts for individual terms. Extensive empirical evaluation demonstrates that EINDM yields the best performance, in terms of NMI, homogeneity, and cluster purity, compared to recent state-of-the-art clustering models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call