Abstract

In recent years, topic modeling is gaining significant momentum in information retrieval (IR). Researchers have found that utilizing the topic information generated through topic modeling together with traditional TF-IDF information generates superior results in document retrieval. However, in order to apply this idea to real-life IR systems, some critical problems need to be solved: how to store the topic information and how to utilize it with the TF-IDF information for efficient document retrieval. In this paper, we propose the Topic Enhanced Inverted Index (TEII) to incorporate the topic information into the inverted index for efficient top-k document retrieval. Specifically, we explore two different types of TEIIs. We first propose the incremental TEII, which includes the topic information into the traditional inverted index by adding topic-based inverted lists. The incremental TEII is beneficial for legacy IR systems, since it does not change the existing TF-IDF-based inverted lists. As a more flexible alternative, we propose the hybrid TEII to incorporate the topic information into each posting of the inverted index. In the hybrid TEII, two relaxation methods are proposed to support dynamic estimation of the upper bound impact of each posting. The hybrid TEII is highly extensible for incorporating different ranking factors and we show an extension of the hybrid TEII by considering the static quality of the documents in the corpus. Based on the incremental and hybrid TEIIs, we develop several query processing algorithms to support efficient top-k document retrieval on TEIIs. Empirical evaluation on the TREC dataset verifies the effectiveness and efficiency of the proposed index structures and query processing algorithms.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call