Abstract

Probabilistic latent semantic indexing (PLSI) has been proposed to represent textual documents as mixture proportions of latent topics. Compared to the standard latent semantic indexing (LSI), PLSI has a solid statistical foundation. However, the need of folding new documents into the latent topic space led to the definition of PLSI folding-in. Previous studies have shown that in case of short queries, poor vocabulary results in small sample of words with non-zero frequencies. Therefore, PLSI folding-in tends to produce topic mixtures that are predominated by a single latent aspect. As a result, folding-in is unable to take into consideration different mapping. Thus, Bayesian folding-in was introduced to involve the topic mixtures of the known document and the mixture proportions of topics for a new document were estimated by maximizing the posterior. Hence, the prior was defined as a kernel density estimate using a Dirichlet distribution. Although Bayesian folding-in overcomes PLSI folding-in, there are still some drawbacks since the Dirichlet distribution has negative covariance structure that makes it restrictive, especially in the case of count data. To improve previous works, we propose using generalized Dirichlet (GD) distribution and Beta-Liouville (BL) distribution as kernel densities in a Bayesian framework for information retrieval.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call