Heuristic-based approaches for speeding up incremental record linkage

Dimas Cassimiro Do Nascimento,Carlos Eduardo Santos Pires,Demetrio Gomes Mestre

doi:10.1016/j.jss.2017.11.074

Abstract

Record Linkage is the task of processing a dataset in order to identify which records refer to the same real world entity. The intrinsic complexity of this task brings many challenges to traditional or naive approaches, especially in contexts such as Big Data, unstructured data and frequent data increments over the dataset. To deal with these contexts, especially the latter, an incremental record linkage approach may be employed in order to avoid (re)processing the entire dataset to update the deduplication results. For doing so, different classification techniques can be employed to identify duplicate entities. Recently, many algorithms have been proposed to combine collective classification, which employs clustering algorithms, together with the incremental principle. In this article, we propose new metrics for incremental record linkage using collective classification and new heuristics (which combine clustering, coverage component filters and a greedy approach) to speed up even more a solution to incremental record linkage. These heuristics have been evaluated using three different scale datasets and the results were analyzed and discussed based on both classical and the newly proposed metrics. The experiments present different trade-offs, regarding efficacy and efficiency results, which are generated by the considered heuristics. Also, the results indicate that, for large and frequent data increments, it is possible to slightly reduce efficacy results by employing a coverage filter-based heuristic that is reasonably faster than the current state-of-the-art approach. In turn, it is also possible to employ single-pass clustering algorithms, which are able to execute significantly faster than the state-of-the-art approach at the cost of sacrificing precision results.

Full Text