Models, Inference, and Implementation for Scalable Probabilistic Models of Text

Ke Zhai

doi:10.13016/m2p60x

Abstract

Title of dissertation: Models, Inference, and Implementation for Scalable Probabilistic Models of Text Ke Zhai, Ph.D., 2014 Dept. of Computer Science Dissertation directed by: Professor Jordan Boyd-Graber iSchool, UMIACS Unsupervised probabilistic Bayesian models are powerful tools for statistical analysis, especially in the area of information retrieval, document analysis and text processing. Despite their success, unsupervised probabilistic Bayesian models are often slow in inference due to inter-entangled mutually dependent latent variables. In addition, the parameter space of these models is usually very large. On the other hand, as the data from various different media sources—for example, internet, electronic books, digital films, etc—become widely accessible, lack of scalability for these unsupervised probabilistic Bayesian models is a critical bottleneck. The primary focus of this dissertation is to speed up the inference process in unsupervised probabilistic Bayesian models. There are two common solutions to scale the algorithm up to large data: parallelization or streaming. The former achieves scalability by distributing the data and the computation to multiple machines. The latter assumes data come in a stream and updates the model gradually after seeing each data observation. It is able to scale to larger datasets because it takes only one pass over the entire data. In this thesis, we examine both approaches. We first demonstrate the effectiveness of the parallelization on a class of unsupervised Bayesian models—topic models, which are exemplified by latent Dirichlet allocation (LDA). We propose a fast parallel implementation using variational inference using the MapReduce framework, referred to as Mr. LDA. We further show that our implementation, unlike highly tuned and specialized implementations, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: 1) informed priors to guide topic discovery and 2) extracting topics from a multilingual corpus. We show that parallelization enables topic models to handle significantly larger datasets. We further extend multilingual Mr. LDA to include tree priors and propose three different inference methods to infer the latent variables. We examine the effectiveness of different inference methods on the task of machine translation in which we use the proposed model to extract domain knowledge that considers both source and target languages. We apply it on a large collection of aligned ChineseEnglish sentences and show that our model yields significant improvement on BLEU score over strong baselines. Other than parallelization, another approach to deal with scalability is to learn parameters in an online streaming setting. Although many online algorithms have been proposed for LDA, they all overlook a fundamental but challenging problem—the vocabulary is constantly evolving over time. To address this problem, we propose an online LDA with infinite vocabulary—infvoc LDA. We use online stochastic variational inference and propose heuristics to dynamically order, expand, and contract the set of words in our vocabulary. We show that our algorithm is able to discover better topics by incorporating new words into the vocabulary and constantly refining the topics over time. In addition to LDA, we also show generality of the online stochastic variational inference approach by applying it to adaptor grammars, which are a broader class of models which subsume LDA. With proper grammar rules, it simplifies to the exact LDA model, however, it provides more flexibility to alter or extend LDA with different grammar rules. We develop a hybrid online inference scheme, and show that our method discovers high-quality structure more quickly than both MCMC and variational inference methods. Models, Inference, and Implementation for Scalable Probabilistic Models of Text

Full Text