Abstract

Query expansion is an important task in information retrieval applications that improves the user query and helps in retrieving the relevant documents. In this paper, N gram Thesaurus is constructed from the documents for query expansion. The HTML TAGs in web documents are considered and their syntactical context is understood. Based on the nature, properties and significances, the TAGs are assigned a suitable weight. Later, the term weight is calculated using corresponding TAG weight and term frequency and later updated into the inverted index. All the single terms in the inverted index are updated as Unigrams in the Thesaurus. Further, Bigrams are constructed using Unigrams. Likewise, the rest of the (N + 1) grams are generated using N grams and their weights and later updated into the Thesaurus. During the query session, the user query terms are expanded based on the predicted N grams provided by the Thesaurus that are given as suggestions to the user. The performance of the proposed approach is evaluated using the Clueweb09B, WT10g and GOV2 benchmark dataset. The improvement gain against baseline is considered as an evaluation parameter and the proposed approach has acheved 7.9% gain on ClueWeb09B, 18.3% on WT10g and 29.4% on GOV2 in terms of Mean Average Precision (MAP). We also compared the performance of the proposed approach with two other query expansion approaches, KLDCo and BoCo. The approach achieved 0.574 (+0.236), 0.519 (+0.209), 0.422 (+0.185) and 0.654 (+0.243) gain in terms P@5, P@10, MAP and MRR against baselines.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.