Term Frequency Normalization Research Articles

AbstractMany well‐known probabilistic information retrieval models have shown promise for use in document ranking, especially BM25. Nevertheless, it is observed that the control parameters in BM25 usually need to be adjusted to achieve improved performance on different data sets; additionally, the assumption in BM25 on the bag‐of‐words model prevents its direct utilization of rich information that lies at the sentence or document level. Inspired by the above challenges with respect to BM25, we first propose a new normalization method on the term frequency in BM25 (called BM25QL in this paper); in addition, the method is incorporated into CRTER2, a recent BM25‐based model, to construct CRTER2QL. Then, we incorporate topic modeling and word embedding into BM25 to relax the assumption of the bag‐of‐words model. In this direction, we propose a topic‐based retrieval model, TopTF, for BM25, which is then further incorporated into the language model (LM) and the multiple aspect term frequency (MATF) model. Furthermore, an enhanced topic‐based term frequency normalization framework, ETopTF, based on embedding is presented. Experimental studies demonstrate the great effectiveness and performance of these methods. Specifically, on all tested data sets and in terms of the mean average precision (MAP), our proposed models, BM25QL and CRTER2QL, are comparable to BM25 and CRTER2 with the best b parameter value; the TopTF models significantly outperform the baselines, and the ETopTF models could further improve the TopTF in terms of the MAP.

The standard approach for term frequency normalization is based only on the document length. However, it does not distinguish the verbosity from the scope, these being the two main factors determining the document length. Because the verbosity and scope have largely different effects on the increase in term frequency, the standard approach can easily suffer from insufficient or excessive penalization depending on the specific type of long document. To overcome these problems, this article proposes two-stage normalization by performing verbosity and scope normalization separately, and by employing different penalization functions. In verbosity normalization, each document is prenormalized by dividing the term frequency by the verbosity of the document. In scope normalization, an existing retrieval model is applied in a straightforward manner to the prenormalized document, finally leading us to formulate our proposed verbosity normalized (VN) retrieval model. Experimental results carried out on standard TREC collections demonstrate that the VN model leads to marginal but statistically significant improvements over standard retrieval models.

Term Frequency Normalization Research Articles

Articles published on Term Frequency Normalization

A topic‐based term frequency normalization framework to enhance probabilistic information retrieval

Two-Stage Document Length Normalization for Information Retrieval

Term Importance Degree Impact on Search Result Clustering

Adaptive Query Term Weight Based on Cloud Model for BM25

A nonparametric term weighting method for information retrieval based on measuring the divergence from independence

On setting the hyper-parameters of term frequency normalization for information retrieval

Probabilistic models of information retrieval based on measuring the divergence from randomness

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Term Frequency Normalization Research Articles

Articles published on Term Frequency Normalization

A topic‐based term frequency normalization framework to enhance probabilistic information retrieval

Two-Stage Document Length Normalization for Information Retrieval

Term Importance Degree Impact on Search Result Clustering

Adaptive Query Term Weight Based on Cloud Model for BM25

A nonparametric term weighting method for information retrieval based on measuring the divergence from independence

On setting the hyper-parameters of term frequency normalization for information retrieval

Probabilistic models of information retrieval based on measuring the divergence from randomness