Abstract

Ad hoc information retrieval (ad hoc IR) is a challenging task consisting of ranking text documents for bag-of-words (BOW) queries. Classic approaches based on query and document text vectors use term-weighting functions to rank the documents. Some of these methods’ limitations consist of their inability to work with polysemic concepts. In addition, these methods introduce fake orthogonalities between semantically related words. To address these limitations, model-based IR approaches based on topics have been explored. Specifically, topic models based on Latent Dirichlet Allocation (LDA) allow building representations of text documents in the latent space of topics, the better modeling of polysemy and avoiding the generation of orthogonal representations between related terms. We extend LDA-based IR strategies using different ensemble strategies. Model selection obeys the ensemble learning paradigm, for which we test two successful approaches widely used in supervised learning. We study Boosting and Bagging techniques for topic models, using each model as a weak IR expert. Then, we merge the ranking lists obtained from each model using a simple but effective top-k list fusion approach. We show that our proposal strengthens the results in precision and recall, outperforming classic IR models and strong baselines based on topic models.

Highlights

  • We introduce the necessary knowledge background to present our proposal

  • We compare the performance of our three methods, Latent Dirichlet Allocation (LDA) Ens, BAGG Ens, and ADA

  • To illustrate the differences between the four methods based on topic models, we compare the top-5 words of the highly coherent topics detected for LDA in each dataset

Read more

Summary

Introduction

We introduce the necessary knowledge background to present our proposal. The environment needed for this work consists of the ad hoc IR method proposed by Wei andCroft [19], which extends the query likelihood model using topic models.Formally, let C be a text corpora. We introduce the necessary knowledge background to present our proposal. The environment needed for this work consists of the ad hoc IR method proposed by Wei and. Croft [19], which extends the query likelihood model using topic models. Each document di ∈ C is represented by a topic distribution Θdi = {θdi ,1 , θdi ,2 , . Θdi ,K }, where K represents the number of topics. The topic model provides a probability distribution φj over the words for each topic j. The topic model of C corresponds to the collection of topics Φ = {φ1 , φ2 , .

Objectives
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.