Abstract

This paper introduces the Poisson-Gamma Latent Dirichlet Allocation (PGLDA) model for modeling word dependencies in topics modeling. The Poisson document length distribution has been used extensively in the past for modeling topics with the expectation that its effect will fizzle out at the end of the model definition. This procedure often leads to downplaying the effect of word correlation with topics and thus reducing the precision or accuracy of retrieved documents in such a situation. Therefore, we propose a new class of model that relaxes the words independence assumption in the existing Latent Dirichlet Allocation (LDA) model by introducing the Gamma distribution that can capture the correlation between adjacent words in a document. The Poisson document length distribution and Gamma correlation distribution are then convoluted to form a new mixture distribution for modeling word dependencies. Model parameter estimation was achieved via Laplacian approximation of the log-likelihood. The new model was then evaluated using the 20 Newsgroups and AG's News datasets. The applicability of the model was assessed using the F1 score. The results of the evaluation showed appreciable supremacy of PGLDA over LDA.

Highlights

  • A topic is defined as a random variable with a unique probability distribution over a fixed vocabulary (Jiang et al, 2015; Wang and Zhang, 2016; Chen, 2017)

  • The main parameter of concern that determines the validity of Poisson-Gamma Latent Dirichlet Allocation (PGLDA) over Latent Dirichlet Allocation (LDA) is b

  • If the Poisson distribution is accurate for modeling document length, it is expected that the mean words per document ( ) and variance of words per document should be equal

Read more

Summary

Introduction

A topic is defined as a random variable with a unique probability distribution over a fixed vocabulary (Jiang et al, 2015; Wang and Zhang, 2016; Chen, 2017). A topic is made up of different words in a vocabulary. A document is made up of several topics. Topic modeling involves working with the N X K matrix of document and topics and subsequently K X V matrix of topics and words, where N, K, V are the number of documents, topics and words, respectively (Liu et al, 2016; Zhao et al, 2019). The first step in topic modeling is to define a generative process for simulating documents. LDA and PLSA are the foundation models in topic modeling, but more valid and relevant models have been developed in recent times (Liu et al, 2016). To develop an extended topic model, it is crucial to understand LDA

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call