A Nested Chinese Restaurant Topic Model for Short Texts with Document Embeddings

Yue Niu,Hongjie Zhang,Jing Li

doi:10.3390/app11188708

Yue Niu, Hongjie Zhang + Show 1 more

Open Access

https://doi.org/10.3390/app11188708

Copy DOI

Abstract

In recent years, short texts have become a kind of prevalent text on the internet. Due to the short length of each text, conventional topic models for short texts suffer from the sparsity of word co-occurrence information. Researchers have proposed different kinds of customized topic models for short texts by providing additional word co-occurrence information. However, these models cannot incorporate sufficient semantic word co-occurrence information and may bring additional noisy information. To address these issues, we propose a self-aggregated topic model incorporating document embeddings. Aggregating short texts into long documents according to document embeddings can provide sufficient word co-occurrence information and avoid incorporating non-semantic word co-occurrence information. However, document embeddings of short texts contain a lot of noisy information resulting from the sparsity of word co-occurrence information. So we discard noisy information by changing the document embeddings into global and local semantic information. The global semantic information is the similarity probability distribution on the entire dataset and the local semantic information is the distances of similar short texts. Then we adopt a nested Chinese restaurant process to incorporate these two kinds of information. Finally, we compare our model to several state-of-the-art models on four real-world short texts corpus. The experiment results show that our model achieves better performances in terms of topic coherence and classification accuracy.

Highlights

With the growth of social media and applications of mobile phones, short texts have been a kind of prevalent and important information on the internet
The joint probability distribution of our model is p(l∗, l, z, w, α, β, γ, δ), where l∗ is the long document variable generated from the first step of the nested Chinese restaurant process, l is the long document variable generated from the second step of the nested Chinese restaurant process, z is the variable of the topic, w is the variable of the word, α is the dispersion prior of short texts that sample l∗, β is the dispersion prior of short texts that sample l, γ is the prior of multi-nominal distribution between z and l, and δ is the prior of multi-nominal distribution between z and w
We set parameters of DESTM as α = 0 and η = −1. The model with these settings means aggregating short texts according to the complete document embeddings and of cause including all noisy information

Summary

Introduction

With the growth of social media and applications of mobile phones, short texts have been a kind of prevalent and important information on the internet. For documents with regular size, conventional topic models like LDA [2] and HDP [3] perform well These methods can automatically generate topics according to word co-occurrence information. Other methods incorporate word embeddings generated from an auxiliary corpus with documents of regular length [10,11,12,13,14]. Long documents incorporate sufficient word co-occurrence information and make local word co-occurrence information no longer sparse These models seem more reasonable than other strategies and auxiliary information is not needed. Long documents generated by our model can effectively avoid incorporating non-semantic word co-occurrence information. Document embedding information provides similarities of short texts and can avoid incorporating non-semantic word co-occurrence information.

Models with Auxiliary Information

Models without Auxiliary Information

Model and Inference

Overview

The first customer sits at the first table

Incorporating Document Embeddings

Inference

Sampling Long Documents Assignments l

Sampling Topics Assignments z

DESTM Gibbs Sampling Process

Datasets

Parameter Settings

Topic Evaluation by Topic Coherence

Topic Evaluation by Classification Accuracy

Experimental Results for Complete and Partial Document Embeddings

Semantic Explanations of Topic Demonstrations

Efficiency Analysis

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Sep 18, 2021
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Nested Chinese Restaurant Topic Model for Short Texts with Document Embeddings

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

GLTM: A Global and Local Word Embedding-Based Topic Model for Short Texts
Wenxin Liang ... Ran Feng
IEEE Access | VOL. 6
Wenxin Liang, et. al.Wenxin Liang ... Ran Feng
01 Jan 2018
IEEE Access | VOL. 6

A Pitman-Yor Process Self-Aggregated Topic Model for Short Texts of Social Media
Yue Niu ... Jing Li
IEEE Access | VOL. 9
Yue Niu, et. al.Yue Niu ... Jing Li
01 Jan 2020
IEEE Access | VOL. 9

Topic Modeling for Short Texts with Auxiliary Word Embeddings
Chenliang Li ... Aixin Sun
-
Chenliang Li, et. al.Chenliang Li ... Aixin Sun
07 Jul 2016
07 Jul 2016

Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings
Chenliang Li ... Aixin Sun
ACM Transactions on Information Systems | VOL. 36
Chenliang Li, et. al.Chenliang Li ... Aixin Sun
21 Aug 2017
ACM Transactions on Information Systems | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Nested Chinese Restaurant Topic Model for Short Texts with Document Embeddings

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences