Let $$\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \ldots ,\mathsf {T}_D\}$$D={T1,T2,ź,TD} be a collection of D string documents of n characters in total, that are drawn from an alphabet set $$\varSigma =[\sigma ]$$Σ=[ź]. The top-k document retrieval problem is to preprocess $$\mathcal{D}$$D into a data structure that, given a query $$(P[1\ldots p],k)$$(P[1źp],k), can return the k documents of $$\mathcal{D}$$D most relevant to the pattern P. The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in $$\mathsf {T}_d$$Td. For example, it can be the term frequency (i.e., the number of occurrences of P in $$\mathsf {T}_d$$Td), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in $$\mathsf {T}_d$$Td), or a pattern-independent importance score of $$\mathsf {T}_d$$Td such as PageRank. Linear space and optimal query time solutions already exist for the general top-k document retrieval problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only o(n) bits on top of any compressed suffix array of $$\mathcal{D}$$D and solves queries in $$O((p+k) {{\mathrm{polylog}}}\,\,n)$$O((p+k)polylogn) time. We also show that scores that consist of a weighted combination of term proximity, term frequency, and document importance, can be handled using twice the space required to represent the text collection.
Read full abstract