An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering

Jay Kumar,Wazir Ali,Salah Uddin,Junming Shao

doi:10.18653/v1/2020.acl-main.70

Abstract

Clustering short text streams is a challenging task due to its unique properties: infinite length, sparse data representation and cluster evolution. Existing approaches often exploit short text streams in a batch way. However, determine the optimal batch size is usually a difficult task since we have no priori knowledge when the topics evolve. In addition, traditional independent word representation in graphical model tends to cause “term ambiguity” problem in short text clustering. Therefore, in this paper, we propose an Online Semantic-enhanced Dirichlet Model for short sext stream clustering, called OSDM, which integrates the word-occurance semantic information (i.e., context) into a new graphical model and clusters each arriving short text automatically in an online way. Extensive results have demonstrated that OSDM has better performance compared to many state-of-the-art algorithms on both synthetic and real-world data sets.

Highlights

A massive amount of short text data is constantly generated with online social platforms such as microblogs, Twitter and Facebook
Traditional clustering algorithms for static data were enhanced and transformed for text streams (Zhong, 2005). They are replaced by model-based algorithms such as Latent Dirichlet Allocation (LDA) (Blei et al, 2003), dynamic topic model (DTM) (Blei and Lafferty, 2006), TDPM (Ahmed and Xing, 2008), GSDMM(Yin and Wang, 2016b), DPMFP (Huang et al, 2013), TM-LDA (Wang et al, 2012), NPMM (Chen et al, 2019) and MStream (Yin et al, 2018), to mention a few
Since, drawing samples from distribution is repeated, so the same color may appear more than once. This defines that we have K number of distinct colors and n number of draws. This condition is defined by a well-known process called Chinese restaurant process (CRP) (Ferguson and Thomas S Ferguson, 1973)

Summary

Introduction

A massive amount of short text data is constantly generated with online social platforms such as microblogs, Twitter and Facebook Clustering of such short text streams has gained increasing attention in recent years due to many real-world applications like event tracking, hot topic detection, and news recommendation (Hadifar et al, 2019). For most established approaches, they often work in a batch way, and assume the instances within a batch are interchangeable This assumption usually cannot hold for topic-evolving text data corpus. Tweets of these two topics share few common terms, i.e., ’health’ or ’apple’. The co-occurring terms representation (i.e., context) helps a model to identify the topic correctly To solve these aforementioned issues, we propose an online semantic-enhanced dirichlet model for short text stream clustering. The online model is free of determining the optimal batch size, and lends itself to handling large-scale data streams efficiently; (2) To the best of our knowledge, it is the first work to integrate semantic information for model-based online clustering, which is able to handle “term ambiguity" problem effectively and support high-quality clustering; (3) Equipped with Poly Urn Scheme, the number of clusters (topics) are determined automatically in our cluster model

Related Work

Problem Formulation

Dirichlet Process

Model Representation

Model Formulation

OSDM Algorithm

Datasets and evaluation metrics

Baselines

Comparison with state-of-the-art methods

Sensitivity Analysis

Runtime

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 54	License type: cc-by

Similar Papers

A Drift-Sensitive Distributed LSTM Method for Short Text Stream Classification
Peipei Li ... Kui Yu
IEEE Transactions on Big Data | VOL. 9
Peipei Li, et. al.Peipei Li ... Kui Yu
01 Feb 2023
IEEE Transactions on Big Data | VOL. 9

An Online Dirichlet Model based on Sentence Embedding and DBSCAN for Noisy Short Text Stream Clustering
Xianliang Si ... Yuhong Zhang
-
Xianliang Si, et. al.Xianliang Si ... Yuhong Zhang
18 Jul 2022
18 Jul 2022

A Dirichlet process biterm-based mixture model for short text stream clustering
Junyang Chen ... Weiwen Liu
Applied intelligence (Dordrecht, Netherlands) | VOL. 50
Junyang Chen, et. al.Junyang Chen ... Weiwen Liu
01 Feb 2020
Applied intelligence (Dordrecht, Netherlands) | VOL. 50

An Online Semantic-Enhanced Graphical Model for Evolving Short Text Stream Clustering.
Jay Kumar ... Salah Ud Din
IEEE transactions on cybernetics | VOL. 52
Jay Kumar, et. al.Jay Kumar ... Salah Ud Din
01 Dec 2022
IEEE transactions on cybernetics | VOL. 52

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Online Semantic-enhanced Dirichlet Model for Short Text Stream Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers