Abstract

In this paper, we study edit similarity query processing to find strings similar to a query string from a collection of strings. To solve the problem, many algorithms have been proposed under a filter-and-verification framework, where candidate strings are generated and refined using a few filters and then verified to find true matches. A major focus of those algorithms has been on generating candidates as small as possible in an early stage of the query processing. A typical approach to generate candidates is to extract some signatures from a query and take union of string ids in the inverted lists of the extracted signatures. However, the number of candidates generated from existing techniques is extremely larger than the number of answer strings and costs for refinement and verification are expensive. To address the problem, we propose an intersection-based candidate generation scheme, which generates a substantially smaller number of candidates. Given some signatures of a query, the proposed scheme first categorizes signatures into several groups. Then, it takes intersection of string ids in the inverted lists of the signatures in each group. Finally, it takes union of the intersections to generate candidates. To minimize the number of candidates under our scheme, we propose a novel algorithm which judiciously selects an optimal signature group. We show through experiments that our technique is very effective in reducing the number of candidates and significantly improves the performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.