Power Law Distributions in Information Retrieval

Casper Petersen,Christina Lioma,Jakob Grue Simonsen

doi:10.1145/2816815

Abstract

Several properties of information retrieval (IR) data, such as query frequency or document length, are widely considered to be approximately distributed as a power law. This common assumption aims to focus on specific characteristics of the empirical probability distribution of such data (e.g., its scale-free nature or its long/fat tail). This assumption, however, may not be always true. Motivated by recent work in the statistical treatment of power law claims, we investigate two research questions: (i) To what extent do power law approximations hold for term frequency, document length, query frequency, query length, citation frequency, and syntactic unigram frequency? And (ii) what is the computational cost of replacing ad hoc power law approximations with more accurate distribution fitting? We study 23 TREC and 5 non-TREC datasets and compare the fit of power laws to 15 other standard probability distributions. We find that query frequency and 5 out of 24 term frequency distributions are best approximated by a power law. All remaining properties are better approximated by the Inverse Gaussian, Generalized Extreme Value, Negative Binomial, or Yule distribution. We also find the overhead of replacing power law approximations by more informed distribution fitting to be negligible, with potential gains to IR tasks like index compression or test collection generation for IR evaluation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Power Law Distributions in Information Retrieval

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Information Systems

Lead the way for us

Journal: ACM Transactions on Information Systems	Publication Date: Feb 16, 2016
Citations: 46

Similar Papers

Scale collapse and the emergence of the power law species–area relationship
Mark Q Wilber ... John Harte
Global Ecology and Biogeography | VOL. 24
Mark Q Wilber, et. al.Mark Q Wilber ... John Harte
29 Apr 2015
Global Ecology and Biogeography | VOL. 24

On the universality of power laws for tokamak plasma predictions
J Garcia ... D Cambon
Plasma Physics and Controlled Fusion | VOL. 60
J Garcia, et. al.J Garcia ... D Cambon
09 Jan 2018
Plasma Physics and Controlled Fusion | VOL. 60

New Fracturing Fluid Viscosity Model to Cure Power Law Mistakes
Denis Vernigora ...
-
Denis Vernigora, et. al.Denis Vernigora ...
26 Oct 2020
26 Oct 2020

Scaling
Stefan Thurner ... Peter Klimekl
-
Stefan Thurner, et. al.Stefan Thurner ... Peter Klimekl
22 Nov 2018
22 Nov 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Power Law Distributions in Information Retrieval

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Information Systems