Abstract

Clustering short text streams is a challenging task due to its unique properties: infinite length, sparse data representation and cluster evolution. Existing approaches often exploit short text streams in a batch way. However, determine the optimal batch size is usually a difficult task since we have no priori knowledge when the topics evolve. In addition, traditional independent word representation in graphical model tends to cause “term ambiguity” problem in short text clustering. Therefore, in this paper, we propose an Online Semantic-enhanced Dirichlet Model for short sext stream clustering, called OSDM, which integrates the word-occurance semantic information (i.e., context) into a new graphical model and clusters each arriving short text automatically in an online way. Extensive results have demonstrated that OSDM has better performance compared to many state-of-the-art algorithms on both synthetic and real-world data sets.

Highlights

  • A massive amount of short text data is constantly generated with online social platforms such as microblogs, Twitter and Facebook

  • Traditional clustering algorithms for static data were enhanced and transformed for text streams (Zhong, 2005). They are replaced by model-based algorithms such as Latent Dirichlet Allocation (LDA) (Blei et al, 2003), dynamic topic model (DTM) (Blei and Lafferty, 2006), TDPM (Ahmed and Xing, 2008), GSDMM(Yin and Wang, 2016b), DPMFP (Huang et al, 2013), TM-LDA (Wang et al, 2012), NPMM (Chen et al, 2019) and MStream (Yin et al, 2018), to mention a few

  • Since, drawing samples from distribution is repeated, so the same color may appear more than once. This defines that we have K number of distinct colors and n number of draws. This condition is defined by a well-known process called Chinese restaurant process (CRP) (Ferguson and Thomas S Ferguson, 1973)

Read more

Summary

Introduction

A massive amount of short text data is constantly generated with online social platforms such as microblogs, Twitter and Facebook Clustering of such short text streams has gained increasing attention in recent years due to many real-world applications like event tracking, hot topic detection, and news recommendation (Hadifar et al, 2019). For most established approaches, they often work in a batch way, and assume the instances within a batch are interchangeable This assumption usually cannot hold for topic-evolving text data corpus. Tweets of these two topics share few common terms, i.e., ’health’ or ’apple’. The co-occurring terms representation (i.e., context) helps a model to identify the topic correctly To solve these aforementioned issues, we propose an online semantic-enhanced dirichlet model for short text stream clustering. The online model is free of determining the optimal batch size, and lends itself to handling large-scale data streams efficiently; (2) To the best of our knowledge, it is the first work to integrate semantic information for model-based online clustering, which is able to handle “term ambiguity" problem effectively and support high-quality clustering; (3) Equipped with Poly Urn Scheme, the number of clusters (topics) are determined automatically in our cluster model

Related Work
Problem Formulation
Dirichlet Process
Model Representation
Model Formulation
OSDM Algorithm
Datasets and evaluation metrics
Baselines
Comparison with state-of-the-art methods
Sensitivity Analysis
Runtime
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call