Let $\mathcal D$ be a collection of $D$ documents, which are strings over an alphabet of size $\sigma$, of total length $n$. We describe a data structure that uses linear space and reports $k$ most relevant documents that contain a query pattern $P$, which is a string of length $p$ packed in $p/\log_\sigma n$ words, in time $O(p/\log_\sigma n+k)$. This is optimal in the RAM model in the general case where $\log D = \Theta(\log n)$, and involves a novel RAM-optimal suffix tree search. Our construction supports an ample set of important relevance measures, such as the number of times $P$ appears in a document (called term frequency), a fixed document importance, and the minimal distance between two occurrences of $P$ in a document. When $\log D = o(\log n)$, we show how to reduce the space of the data structure from $O(n\log n)$ to $O(n(\log\sigma+\log D+\log\log n))$ bits, and to $O(n(\log\sigma+\log D))$ bits in the case of the popular term frequency measure of relevance, at the price of an additive term $O(\log^\varepsilon_\sigma n)$ in the query time, for any constant $\varepsilon>0$. We also consider the dynamic scenario, where documents can be inserted and deleted from the collection. We obtain linear space and query time $O(p(\log\log n)^2/\log_\sigma n+\log n + k\log\log k)$, whereas insertions and deletions require $O(\log^{1+\varepsilon} n)$ time per symbol, for any constant $\varepsilon>0$. Finally, we consider an extended static scenario where an extra parameter $\mathtt{par}(P,d)$ is defined, and the query must retrieve only documents $d$ such that $\mathtt{par}(P,d)\in [\tau_1,\tau_2]$, where this range is specified at query time. We solve these queries using linear space and $O(p/\log_\sigma n + \log^{1+\varepsilon} n + k\log^\varepsilon n)$ time, for any constant $\varepsilon>0$. Our technique is to translate these top-$k$ problems into multidimensional geometric search problems. As a bonus, we describe some improvements to those problems.
Read full abstract