Top-$$k$$ Term-Proximity in Succinct Space

J Ian Munro,Sharma V Thankachan,Rahul Shah,Gonzalo Navarro,Jesper Sindahl Nielsen

doi:10.1007/978-3-319-13075-0_14

Abstract

Let $\mathcal {D} = \{\mathsf {T}_1,\mathsf {T}_2, \dots ,\mathsf {T}_D\}$ be a collection of $D$ string documents of $n$ characters in total, that are drawn from an alphabet set $\Sigma =[\sigma ]$. The top-$k$ document retrieval problem is to preprocess $\mathcal{D}$ into a data structure that, given a query $(P[1..p],k)$, can return the $k$ documents of $\mathcal{D}$ most relevant to pattern $P$. The relevance is captured using a predefined ranking function, which depends on the set of occurrences of $P$ in $\mathsf {T}_d$. For example, it can be the term frequency (i.e., the number of occurrences of $P$ in $\mathsf {T}_d$), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of $P$ in $\mathsf {T}_d$), or a pattern-independent importance score of $\mathsf {T}_d$ such as PageRank. Linear space and optimal query time solutions already exist for this problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only $o(n)$ bits on top of any compressed suffix array of $\mathcal{D}$ and solves queries in time $O((p+k) {{\mathrm{polylog}}}n)$.

Full Text