Time and Space Efficient Large Scale Link Discovery using String Similarities

Andreas Karampelas,George A Vouros

doi:10.3233/fi-2020-1906

Abstract

This paper proposes and evaluates time and space efficient methods for discovering links between matching entities in large data sets, using state of the art methods for measuring edit distance as a string similarity metric. The paper proposes and compares three filtering methods that build on a basic blocking technique to organize the target dataset, facilitating efficient pruning of dissimilar pairs. The proposed filtering methods are compared in terms of runtime and memory usage: The first method exploits the blocking structure using the triangle inequality in conjunction to the substring-matching criterion. The second method uses only the substring-matching criterion, while the third method uses the substring-matching criterion in conjunction to the frequency-matching criterion. Evaluation results show the pruning power of the different criteria used, also in comparison to the string matching functionality provided in LIMES and SILK, which are state-of-the-art tools for large-scale link discovery.

Full Text