Abstract

Online clustering of short text streams has become significant due to the popularity of news and social media platforms. The objective of online clustering is to maintain active topics (clusters) by automatically detecting new topics and forgetting outdated ones. Most existing approaches exploit static and high dimensional semantic term representation of the text to enhance the clustering quality. While these approaches use inference procedures that depend on a fixed batch size to reduce the number of clusters related to a given topic and bring it closer to the actual number of topics. This paper proposes a non-parametric Dirichlet model with episodic inference (EINDM) to cluster the evolving short text stream by introducing a window-based low-dimensional semantic term representation which captures the contextual relationships between words. In addition, an episodic inference procedure is introduced to reduce the cluster sparsity in the model. Furthermore, a novel “word specificity” measure is proposed based on neighborhood terms for evolving contexts for individual terms. Extensive empirical evaluation demonstrates that EINDM yields the best performance, in terms of NMI, homogeneity, and cluster purity, compared to recent state-of-the-art clustering models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.