A probabilistic justification for using tf×idf term weighting in information retrieval

Djoerd Hiemstra

doi:10.1007/s007999900025

Abstract

This paper presents a new probabilistic model of information retrieval. The most important modeling assumption made is that documents and queries are defined by an ordered sequence of single terms. This assumption is not made in well known existing models of information retrieval, but is essential in the field of statistical natural language processing. Advances already made in statistical natural language processing will be used in this paper to formulate a probabilistic justification for using tf.idf term weighting. The paper shows that the new probabilistic interpretation of tf.idf term weighting might lead to better understanding of statistical ranking mechanisms, for example by explaining how they relate to coordination level ranking. A pilot experiment on the TREC collection shows that the linguistically motivated weighting algorithm outperforms the popular BM25 weighting algorithm.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A probabilistic justification for using tf×idf term weighting in information retrieval

Abstract

Talk to us

Similar Papers

More From: International Journal on Digital Libraries

Lead the way for us

Journal: International Journal on Digital Libraries	Publication Date: Aug 1, 2000
Citations: 184

Similar Papers

A Linguistically Motivated Probabilistic Model of Information Retrieval
Djoerd Hiemstra
-
Djoerd HiemstraDjoerd Hiemstra
01 Jan 1998
01 Jan 1998

A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering
Mourad Sarrouti ... Said Ouatik El Alaoui
Journal of Biomedical Informatics | VOL. 68
Mourad Sarrouti, et. al.Mourad Sarrouti ... Said Ouatik El Alaoui
07 Mar 2017
Journal of Biomedical Informatics | VOL. 68

Graph-Based Natural Language Processing and Information Retrieval Rada Mihalcea and Dragomir Radev (University of North Texas and University of Michigan) Cambridge, UK: Cambridge University Press, 2011, viii+192 pp; hardbound, ISBN 978-0-521-89613-9, $65.00
Chris Biemann
Computational Linguistics | VOL. 38
Chris BiemannChris Biemann
01 Mar 2012
Computational Linguistics | VOL. 38

A study of probability kinematics in information retrieval
F Crestani ... C J Van Rijsbergen
ACM Transactions on Information Systems | VOL. 16
F Crestani, et. al.F Crestani ... C J Van Rijsbergen
01 Jul 1998
ACM Transactions on Information Systems | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A probabilistic justification for using tf×idf term weighting in information retrieval

Abstract

Talk to us

Similar Papers

More From: International Journal on Digital Libraries