Abstract

Probabilistic topic models are widely used to discover latent topics in document collections, while latent feature vector representations of words have been used to obtain high performance in many NLP tasks. In this paper, we extend two different Dirichlet multinomial topic models by incorporating latent feature vector representations of words trained on very large corpora to improve the word-topic mapping learnt on a smaller corpus. Experimental results show that by using information from the external corpora, our new models produce significant improvements on topic coherence, document clustering and document classification tasks, especially on datasets with few or short documents.

Highlights

  • We extend the earlier definition of the tensor Ndt,w as the number of times a word w in document d is generated from topic t by the Dirichlet multinomial component of our combined models, which in section 3.3 refers to the Latent feature (LF)-Latent Dirichlet Allocation (LDA) model, while in section 3.4 refers to the LF-Dirichlet Multinomial Mixture (DMM) model

  • We have shown that latent feature representations can be used to improve topic models

  • We proposed two novel latent feature topic models, namely LF-LDA and LF-DMM, that integrate a latent feature model within two topic models LDA and DMM

Read more

Summary

Introduction

Conventional topic modeling algorithms such as these infer document-to-topic and topic-to-word distributions from the co-occurrence of words within documents. Where Dir and Cat stand for a Dirichlet distribution and a categorical distribution, and zdi is the topic indicator for the ith word wdi in document d. The topic-to-word Dirichlet multinomial component generates the word wdi by drawing it from the categorical distribution Cat(φzdi ) for topic zdi. We follow the Gibbs sampling algorithm for estimating LDA topic models as described by Griffiths and Steyvers (2004). By integrating out θ and φ, the algorithm samples the topic zdi for the current ith word wdi in document d using the conditional distribution P(zdi | Z¬di), where Z¬di denotes the topic assignments of all the other words in the document collection D, so: P(zdi = t | Z¬di )

Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.