Abstract

This paper proposes and evaluates time and space efficient methods for discovering links between matching entities in large data sets, using state of the art methods for measuring edit distance as a string similarity metric. The paper proposes and compares three filtering methods that build on a basic blocking technique to organize the target dataset, facilitating efficient pruning of dissimilar pairs. The proposed filtering methods are compared in terms of runtime and memory usage: The first method exploits the blocking structure using the triangle inequality in conjunction to the substring-matching criterion. The second method uses only the substring-matching criterion, while the third method uses the substring-matching criterion in conjunction to the frequency-matching criterion. Evaluation results show the pruning power of the different criteria used, also in comparison to the string matching functionality provided in LIMES and SILK, which are state-of-the-art tools for large-scale link discovery.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.