Abstract

Topic modeling is a significant branch of natural language processing and machine learning focused on inferring the generative process of text. Traditionally, algorithms for estimating topic models have relied on Bayesian inference and Gibbs sampling. This paper proposes a novel “acceptable set” framework for formulating topic modeling problems inspired by ideas from discrete component analysis and data-driven robust optimization. Our approach not only simplifies the design and inference of topic models, but also allows for extensions and generalizations that are challenging to integrate into traditional approaches. Different restrictions (e.g., sparsity) and assumptions (e.g., alternative generative processes) can be easily incorporated into our formulations through additional or modified constraints. Our formulations also naturally control a widely used metric of solution quality, perplexity. We adapt state-of-the-art stochastic gradient methods to find good local optima for the optimization formulations. The algorithms are efficient, scaling to realistic problem sizes with runtimes comparable to existing methods. Through extensive computational experiments, we show that our methods have improved solution quality compared to baseline methods and reconstruct more reliably the underlying generative models. Our framework overcomes known vulnerabilities of traditional topic modeling algorithms: our methods are effective in low-data settings, register good out-of-sample performance, and perform well for a variety of initial assumptions on input parameter values.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call