Abstract

A good search engine aims to have more relevant documents on the top of the list. This paper describes a new technique called ???Improving search engines by demoting non-relevant documents??? (DNR) that improves the precision by detecting and demoting non-relevant documents. DNR generates a new set of queries that are composed of the terms of the original query combined in different ways. The documents retrieved from those new queries are evaluated using a heuristic algorithm to detect the non-relevant ones. These non-relevant documents are moved down the list which will consequently improve the precision. The new technique is tested on WT2g test collection. The testing of the new technique is done using variant retrieval models, which are the vector model based on the TFIDF weighing measure, the probabilistic models based on the BM25, and DFR-BM25 weighing measures. The recall and precision ratios are used to compare the performance of the new technique against the performance of the original query.

Highlights

  • Search engines extract user-specified information from documents and files, ranging from books to online blogs, journals, and academic articles [1]

  • The new technique is tested on WT2g1 test collection using the vector model [9,10,11,12] based on the TFIDF weighing measure[13,14], the probabilistic models [15] based on the Best Match 25 (BM25), and DFR-BM25 weighing measures[16,17,18]

  • When DNR is tested in the probabilistic model based on BM25 weighting measure [18] it classified 3631 non-relevant documents as non-relevant

Read more

Summary

INTRODUCTION

Search engines extract user-specified information from documents and files, ranging from books to online blogs, journals, and academic articles [1]. Search engines cannot be 100% accurate because the document relevance is subjective and depends on the user's judgment, which depends on many factors such as his knowledge about the topic, the reason for searching, and his satisfaction with the returned result [3].There are many challenges involved in making a search engine successful [2,4]. These challenges include acquiring lots of relevant documents from many sources, extracting useful representations of the documents to facilitate search, ranking documents in response to a user request, and presenting the search results effectively by posting the most relevant document on the top of the list [5,6,7]. The recall and precision ratios are used to compare the performance of the new technique against the performance of the original query

Vector Model
Probabilistic
WEIGHTING TERMS
10 DocFreqi
DFR-BM25
TEST COLLECTION
ASSESSMENT
THE NEW TECHNIQUE
EXPERIMENTS AND RESULTS
Using the vector model based on TFIDF
Using the probabilistic model based on BM25
97 Relevant Rejected : 526
Using the probabilistic model based on DFR_BM251
93 Relevant Rejected : 533
CONCLUSIONS

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.