Abstract

Let \(\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \dots ,\mathsf {T}_D\}\) be a collection of \(D\) string documents of \(n\) characters in total, that are drawn from an alphabet set \(\Sigma =[\sigma ]\). The top-\(k\) document retrieval problem is to preprocess \(\mathcal{D}\) into a data structure that, given a query \((P[1..p],k)\), can return the \(k\) documents of \(\mathcal{D}\) most relevant to pattern \(P\). The relevance is captured using a predefined ranking function, which depends on the set of occurrences of \(P\) in \(\mathsf {T}_d\). For example, it can be the term frequency (i.e., the number of occurrences of \(P\) in \(\mathsf {T}_d\)), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of \(P\) in \(\mathsf {T}_d\)), or a pattern-independent importance score of \(\mathsf {T}_d\) such as PageRank. Linear space and optimal query time solutions already exist for this problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only \(o(n)\) bits on top of any compressed suffix array of \(\mathcal{D}\) and solves queries in time \(O((p+k) {{\mathrm{polylog}}}n)\).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call