Ranked Document Retrieval in External Memory

Rahul Shah,Cheng Sheng,Jeffrey Scott Vitter,Sharma V Thankachan

doi:10.1145/3559763

Abstract

The ranked (or top- k ) document retrieval problem is defined as follows: preprocess a collection {T 1 ,T 2 ,… ,T d } of d strings (called documents) of total length n into a data structure, such that for any given query (P,k) , where P is a string (called pattern) of length p ≥ 1 and k ∈ [1,d] is an integer, the identifiers of those k documents that are most relevant to P can be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented an O(n) -space (in words) data structure with O(p+k log k) query time. The query time was later improved to O(p+k) [SODA 2012] and further to O(p/ log σn+k ) [SIAM Journal on Computing 2017] by Navarro and Nekrich, where σ is the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takes O(n) -space and answer queries in O(p/B + log B n + k/B+ log * (n/B) ) I/Os, where B is the block size. The second one takes O(n log * (n/B) ) space and answer queries in optimal O(p/B + log B n + k/B) I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top- k document retrieval, we present an O(n log (d/B)) space data structure with optimal query cost.

Full Text