Abstract

We study the Document Listing problem, where a collection D of documents d 1,...,d k of total length ∑ i d i = n is to be preprocessed, so that one can later efficiently list all the \(\textrm{ndoc}\) documents containing a given query pattern P of length m as a substring. Muthukrishnan (SODA 2002) gave an optimal solution to the problem; with O(n) time preprocessing, one can answer the queries in \(O(m+\textrm{ndoc})\) time. In this paper, we improve the space-requirement of the Muthukrishnan’s solution from O(n logn) bits to |CSA| + 2n + nlogk (1 + o(1)) bits, where |CSA| ≤ n log|Σ|(1 + o(1)) is the size of any suitable compressed suffix array (CSA), and Σ is the underlying alphabet of documents. The time requirement depends on the CSA used, but we can obtain e.g. the optimal \(O(m+\textrm{ndoc})\) time when Open image in new window . For general |Σ|,k the time requirement becomes \(O(m \log |\Sigma|+\textrm{ndoc} \log k)\). Sadakane (ISAAC 2002) has developed a similar space-efficient variant of the Muthukrishnan’s solution; we obtain a better time requirement in most cases, but a slightly worse space requirement.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.