Abstract
This paper studies the near-duplicate text alignment problem under the constraint of Jaccard similarity. Specifically, given a collection of long texts and a short query text, this problem finds all the subsequences in each text whose Jaccard similarities to the query are no smaller than a given threshold. Near-duplicate text alignment is computationally intensive. This is because there are O(n 2 ) subsequences in a text with n tokens. To remedy this issue, a few recent studies propose to first generate the min-hash sketch of every subsequence in each text and then find all the subsequences whose min-hash sketches are similar to that of the query. They introduce the concept of "compact windows" and show that the O(n 2 k) min-hashes in a text with n tokens can be losslessly compressed in compact windows using O(nk) space, where k is the sketch size. However, the space cost O(nk) is still too high for long texts, especially when the sketch size k is large. To address this issue, we propose to use One Permutation Hashing (OPH) to generate the min-hash sketch and introduce the concept of "OPH compact windows". Although the size of each sketch remains the same, which is O(k), we prove that all the O(n 2 k) min-hashes generated by OPH in a text with n tokens can be losslessly compressed in OPH compact windows using only O(n+k) space. Note the generation of OPH compact windows does not necessitate the enumeration of the O(n 2 k) min-hashes. Moreover, we develop an algorithm to find all the sketches in a text similar to that of the query directly from OPH compact windows, along with three optimizations.We conduct extensive experiments on three real-world datasets. Empirical results show our proposed algorithms significantly outperformed existing methods in terms of index cost and query latency and scaled well.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.