Improving Topic Models with Latent Feature Word Representations

Dat Quoc Nguyen,Richard Billingsley,Lan Du,Mark Johnson

doi:10.1162/tacl_a_00140

Abstract

Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

Highlights

We extend the earlier definition of the tensor Ndt,w as the number of times a word w in document d is generated from topic t by the Dirichlet multinomial component of our combined models, which in section 3.3 refers to the Latent feature (LF)-Latent Dirichlet Allocation (LDA) model, while in section 3.4 refers to the LF-Dirichlet Multinomial Mixture (DMM) model
We have shown that latent feature representations can be used to improve topic models
We proposed two novel latent feature topic models, namely LF-LDA and LF-DMM, that integrate a latent feature model within two topic models LDA and DMM

Summary

Introduction

Conventional topic modeling algorithms such as these infer document-to-topic and topic-to-word distributions from the co-occurrence of words within documents. Where Dir and Cat stand for a Dirichlet distribution and a categorical distribution, and zdi is the topic indicator for the ith word wdi in document d. The topic-to-word Dirichlet multinomial component generates the word wdi by drawing it from the categorical distribution Cat(φzdi ) for topic zdi. We follow the Gibbs sampling algorithm for estimating LDA topic models as described by Griffiths and Steyvers (2004). By integrating out θ and φ, the algorithm samples the topic zdi for the current ith word wdi in document d using the conditional distribution P(zdi | Z¬di), where Z¬di denotes the topic assignments of all the other words in the document collection D, so: P(zdi = t | Z¬di )

Methods

Findings

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Dec 1, 2015
Citations: 342	License type: cc-by

R Discovery Prime

R Discovery Prime

Improving Topic Models with Latent Feature Word Representations

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

Latent Feature Word Representations to Enhance Topic Models for Text Mining Algorithms
Dr Thayyaba Khatoon Mohammed ... Dr.Vsk Reddy
International Journal of Engineering and Advanced Technology | VOL. 9
Dr Thayyaba Khatoon Mohammed, et. al.Dr Thayyaba Khatoon Mohammed ... Dr.Vsk Reddy
30 Dec 2020
International Journal of Engineering and Advanced Technology | VOL. 9

Improving Text Models with Latent Feature Vector Representations
Peng Huaijin ... Shen Qiwei
-
Peng Huaijin, et. al.Peng Huaijin ... Shen Qiwei
01 Jan 2019
01 Jan 2019

Enhancing Short Text Topic Modeling with FastText Embeddings
Fan Zhang ... Bo Zhang
-
Fan Zhang, et. al.Fan Zhang ... Bo Zhang
01 Jun 2020
01 Jun 2020

Word Embedding based Clustering to Detect Topics in Social Media
Carmela Comito ... Agostino Forestiero
-
Carmela Comito, et. al.Carmela Comito ... Agostino Forestiero
14 Oct 2019
14 Oct 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving Topic Models with Latent Feature Word Representations

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics