Abstract

Topic modeling analyzes documents to learn meaningful patterns of words. However, existing topic models fail to learn interpretable topics when working with large and heavy-tailed vocabularies. To this end, we develop the embedded topic model (etm), a generative model of documents that marries traditional topic models with word embeddings. More specifically, the etm models each word with a categorical distribution whose natural parameter is the inner product between the word’s embedding and an embedding of its assigned topic. To fit the etm, we develop an efficient amortized variational inference algorithm. The etm discovers interpretable topics even with large vocabularies that include rare words and stop words. It outperforms existing document models, such as latent Dirichlet allocation, in terms of both topic quality and predictive performance.

Highlights

  • Topic models are statistical tools for discovering the hidden semantic structure in a collection of documents (Blei et al, 2003; Blei, 2012)

  • We develop the embedded topic model (ETM), a document model that marries latent Dirichlet allocation (LDA) and word embeddings

  • We focus on the continuous bag-of-words (CBOW) variant of word embeddings (Mikolov et al, 2013b)

Read more

Summary

Introduction

Topic models are statistical tools for discovering the hidden semantic structure in a collection of documents (Blei et al, 2003; Blei, 2012). Topic models and their extensions have been applied to many fields, such as marketing, sociology, political science, and the digital humanities. Most topic models build on latent Dirichlet allocation (LDA) (Blei et al, 2003). LDA is a hierarchical probabilistic model that represents each topic as a distribution over terms and represents each document as a mixture of the topics. The generative process for each document is the following: 1. Draw topic proportion θd ∼ Dirichlet(αθ)

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.