Abstract

Document listing is a fundamental problem in information retrieval. The objective is to retrieve all documents from a document collection that are relevant to an input pattern. Several variations of this problem such as ranked document retrieval, document listing with two patterns and forbidden patterns have been studied. We introduce the problem of document retrieval with forbidden extension. Let D={T1,T2,…,TD} be a collection of D string documents of n characters in total, and P+ and P− be two query patterns, where P+ is a proper prefix of P−. We call P− as the forbidden extension of the included pattern P+. A forbidden extension query 〈P+,P−〉 asks to report all occ documents in D that contains P+ as a substring, but does not contain P− as one. A top-k forbidden extension query 〈P+,P−,k〉 asks to report those k documents among the occ documents that are most relevant to P+, where each document is given a unique fixed score (PageRank) and the relevance of a document is determined based on its score. We present a linear index (in words) with an O(|P−|+occ) query time for the document listing problem. For the top-k version of the problem, we achieve the following space-time trade-offs:•O(n) space (in words) and O(|P−|log⁡σ+k) query time.•|CSA|+|CSA⁎|+Dlog⁡nD+O(n) bits and O(search(P−)+k⋅tSA⋅log2+ϵ⁡n) query time, where ϵ>0 is an arbitrarily small constant.•|CSA|+O(nlog⁡D) bits and O(search(P−)+(k+log⁡D)log⁡D) query time. Here σ is the size of the alphabet set, CSA (of size |CSA| bits) is the compressed suffix array (CSA) of the concatenated text of all documents, CSAd is the CSA of Td and |CSA⁎|=∑d=1D|CSAd|. Also, search(P−) is the time for pattern matching and tSA is the time to find suffix (or inverse suffix) array value.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.