A Pólya Urn Document Language Model for Improved Information Retrieval

Ronan Cummins,Jiaul H Paik,Yuanhua Lv

doi:10.1145/2746231

Abstract

The multinomial language model has been one of the most effective models of retrieval for more than a decade. However, the multinomial distribution does not model one important linguistic phenomenon relating to term dependency—that is, the tendency of a term to repeat itself within a document (i.e., word burstiness). In this article, we model document generation as a random process with reinforcement (a multivariate Pólya process) and develop a Dirichlet compound multinomial language model that captures word burstiness directly. We show that the new reinforced language model can be computed as efficiently as current retrieval models, and with experiments on an extensive set of TREC collections, we show that it significantly outperforms the state-of-the-art language model for a number of standard effectiveness metrics. Experiments also show that the tuning parameter in the proposed model is more robust than that in the multinomial language model. Furthermore, we develop a constraint for the verbosity hypothesis and show that the proposed model adheres to the constraint. Finally, we show that the new language model essentially introduces a measure closely related to idf, which gives theoretical justification for combining the term and document event spaces in tf-idf type schemes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Pólya Urn Document Language Model for Improved Information Retrieval

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Information Systems

Lead the way for us

Journal: ACM Transactions on Information Systems	Publication Date: May 4, 2015
Citations: 27

Similar Papers

Improving the effectiveness of language modeling approaches to information retrieval
Yuanhua Lv
ACM SIGIR Forum | VOL. 46
Yuanhua LvYuanhua Lv
21 Dec 2012
ACM SIGIR Forum | VOL. 46

Author response: An oscillating computational model can track pseudo-rhythmic speech by using linguistic predictions
Sanne ten Oever ... Andrea E Martin
-
Sanne ten Oever, et. al.Sanne ten Oever ... Andrea E Martin
21 Jun 2021
21 Jun 2021

Hypergeometric language models for republished article finding
Manos Tsagkias ... Wouter Weerkamp
-
Manos Tsagkias, et. al.Manos Tsagkias ... Wouter Weerkamp
24 Jul 2011
24 Jul 2011

The state of the art in language modeling
Joshua Goodman
-
Joshua GoodmanJoshua Goodman
01 Jan 2003
01 Jan 2003

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Pólya Urn Document Language Model for Improved Information Retrieval

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Information Systems