Abstract
Record linkage is the process of identifying pairs of records that refer to the same real-world object within or across data files. Basically, each record pair is compared with a similarity function and then classified in supposedly matching and non-matching pairs. However, if every possible record pair has to be compared, the resulting number of comparisons leads to infeasible running times for large data files. In such situations, blocking or indexing methods to reduce the comparison space are required. In this paper we propose a new blocking technique (Q-gram Fingerprinting) that efficiently filters record pairs according to an approximation of a q-gram similarity function. The new method first transforms data records into bit vectors, the fingerprints, and then filters pairs of fingerprints by use of a Multibit Tree according to a user-defined similarity threshold. We examined the effect of different parameter choices of Q-gram Fingerprinting, tested its scalability, and performed a comparison study including several alternative methods using simulated person data. The comparison study showed promising results for the proposed method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.