Topic Modeling for Short Texts via Word Embedding and Document Correlation

Feng Yi,Jianjun Wu,Bo Jiang

doi:10.1109/access.2020.2973207

Abstract

Topic modeling is a widely studied foundational and interesting problem in the text mining domains. Conventional topic models based on word co-occurrences infer the hidden semantic structure from a corpus of documents. However, due to the limited length of short text, data sparsity impedes the inference process of conventional topic models and causes unsatisfactory results on short texts. In fact, each short text usually contains a limited number of topics, and understanding semantic content of short text needs to the relevant background knowledge. Inspired by the observed information, we propose a regularized non-negative matrix factorization topic model for short texts, named TRNMF. The proposed model leverages pre-trained distributional vector representation of words to overcome the data sparsity problem of short texts. Meanwhile, the method employs the clustering mechanism under document-to-topic distributions during the topic inference by using Gibbs Sampling Dirichlet Multinomial Mixture model. TRNMF integrates successfully both word co-occurrence regularization and sentence similarity regularization into topic modeling for short texts. Through extensive experiments on constructed real-world short text corpus, experimental results show that TRNMF can achieve better results than the state-of-the-art methods in term of topic coherence measure and text classification task.

Highlights

Recent years have witnessed the increased development and popularity of various kinds of Web applications such as online social networks, recommender systems and Q&A systems
RELATED WORK we review two lines of relevant research work: 1) topic modeling for short text, 2) topic modeling for short text via vector embeddings
Liang et al [39] propose a global and local word embedding-based topic model (GLTM) for short texts, where the global word embeddings is learned from large external corpus and the local word embeddings is obtained by employing the continuous skip-gram model with negative sampling

Summary

INTRODUCTION

Recent years have witnessed the increased development and popularity of various kinds of Web applications such as online social networks, recommender systems and Q&A systems. It is a simple and method-independent scheme that leverages external knowledge base to alleviate the data sparseness of short texts and discover the latent semantic information over short texts Existing works along this line largely depend on either external thesauri (e.g., WordNet) or lexical knowledge derived from documents in a specific domain (e.g., Wikipedia). The proposed TRNMF extends the non-negative matrix factorization model by introducing topic regularization from large text corpus in the term of topic-word distribution and document regularization by employing clustering mechanism to cluster short texts in. The model leverages the global word-word co-occurrence information learned from large text corpus to alleviate the data sparsity problem, and uses clustering method to improve topic inference quality.

RELATED WORK

MODELING WORD EMBEDDINGS SEMANTIC MATRIX

MODELING DOCUMENT CLUSTERING MATRIX

UNIFIED SHORT TEXT TOPIC MODEL

EXPERIMENT

2) EVALUATION BY TOPIC COHERENCE

CONCLUSION

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 30	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Topic Modeling for Short Texts via Word Embedding and Document Correlation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Topic Modeling for Short Texts with Auxiliary Word Embeddings
Zhiqian Zhang ... Aixin Sun
-
Zhiqian Zhang, et. al.Zhiqian Zhang ... Aixin Sun
07 Jul 2016
07 Jul 2016

Enhancing Topic Modeling for Short Texts with Auxiliary Word Embeddings
Haoran Wang ... Chenliang Li
ACM Transactions on Information Systems | VOL. 36
Haoran Wang, et. al.Haoran Wang ... Chenliang Li
21 Aug 2017
ACM Transactions on Information Systems | VOL. 36

GLTM: A Global and Local Word Embedding-Based Topic Model for Short Texts
Yuangang Li ... Xinyue Liu
IEEE Access | VOL. 6
Yuangang Li, et. al.Yuangang Li ... Xinyue Liu
01 Jan 2018
IEEE Access | VOL. 6

A biterm topic model for short texts
Jiafeng Guo ... Yanyan Lan
-
Jiafeng Guo, et. al.Jiafeng Guo ... Yanyan Lan
13 May 2013
13 May 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Topic Modeling for Short Texts via Word Embedding and Document Correlation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access