A set of novel HTML document quality features for Web information retrieval: Including applications to learning to rank for information retrieval

Ahmet Aydın,Ahmet Arslan,Bekir Taner Dinçer

doi:10.1016/j.eswa.2024.123177

Abstract

The past work on Information Retrieval (IR) targeting web document collections shows that incorporating a measure that measures the quality of web documents, or rather the document prior (e.g., PageRank), into an IR system improves the retrieval effectiveness. In this study, we introduce new document priors and empirically investigate their effect by employing them as features in a learning to rank (LTR) deployment. The experiments are performed on the two standard Web IR test collections: the ClueWeb09 and the ClueWeb12 datasets, which include 500 and 733 million web documents, respectively, and the associated TREC & NTCIR query sets with a total number of 1,204 queries. A strong baseline is formed by using standard features introduced in the previous works, with respect to which the effect of newly introduced features in this paper is empirically compared. We test our features by LambdaMART, which is state-of-the-art LTR technique. The results reveal that the features introduced in this work led improvement in retrieval performance on the test collections in use. The introduced features are classified into 5 groups with respect to functional properties and each group is also analyzed in detail.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A set of novel HTML document quality features for Web information retrieval: Including applications to learning to rank for information retrieval

Abstract

Talk to us

Similar Papers

More From: Expert Systems With Applications

Lead the way for us

Similar Papers

Document retrieval experiments using cluster analysis
Jack Minker ... Gerald A Wilson
Journal of the American Society for Information Science | VOL. 24
Jack Minker, et. al.Jack Minker ... Gerald A Wilson
01 Jul 1973
Journal of the American Society for Information Science | VOL. 24

A late fusion approach to cross-lingual document re-ranking
Dong Zhou ... Vincent Wade
-
Dong Zhou, et. al.Dong Zhou ... Vincent Wade
26 Oct 2010
26 Oct 2010

“A term is known by the company it keeps”: On Selecting a Good Expansion Set in Pseudo-Relevance Feedback
Raghavendra Udupa ... Abhijit Bhole
-
Raghavendra Udupa, et. al.Raghavendra Udupa ... Abhijit Bhole
01 Jan 2009
01 Jan 2009

Homonymy and polysemy in information retrieval
Robert Krovetz
-
Robert KrovetzRobert Krovetz
01 Jan 1997
01 Jan 1997

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A set of novel HTML document quality features for Web information retrieval: Including applications to learning to rank for information retrieval

Abstract

Talk to us

Similar Papers

More From: Expert Systems With Applications