Abstract

Let $$\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \ldots ,\mathsf {T}_D\}$$D={T1,T2,ź,TD} be a collection of D string documents of n characters in total, that are drawn from an alphabet set $$\varSigma =[\sigma ]$$Σ=[ź]. The top-k document retrieval problem is to preprocess $$\mathcal{D}$$D into a data structure that, given a query $$(P[1\ldots p],k)$$(P[1źp],k), can return the k documents of $$\mathcal{D}$$D most relevant to the pattern P. The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in $$\mathsf {T}_d$$Td. For example, it can be the term frequency (i.e., the number of occurrences of P in $$\mathsf {T}_d$$Td), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in $$\mathsf {T}_d$$Td), or a pattern-independent importance score of $$\mathsf {T}_d$$Td such as PageRank. Linear space and optimal query time solutions already exist for the general top-k document retrieval problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only o(n) bits on top of any compressed suffix array of $$\mathcal{D}$$D and solves queries in $$O((p+k) {{\mathrm{polylog}}}\,\,n)$$O((p+k)polylogn) time. We also show that scores that consist of a weighted combination of term proximity, term frequency, and document importance, can be handled using twice the space required to represent the text collection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call