Abstract

Entity Resolution (ER) is the problem of identifying co-referent entity pairs across datasets, including knowledge graphs (KGs). ER is an important prerequisite in many applied KG search and analytics pipelines, with a typical workflow comprising two steps. In the first ’blocking’ step, entities are mapped to blocks. Blocking is necessary for preempting comparing all possible pairs of entities, as (in the second ‘similarity’ step) only entities within blocks are paired and compared, allowing for significant computational savings with a minimal loss of performance. Unfortunately, learning a blocking scheme in an unsupervised fashion is a non-trivial problem, and it has not been properly explored for heterogeneous, semi-structured datasets, such as are prevalent in industrial and Web applications. This article presents an unsupervised algorithmic pipeline for learning Disjunctive Normal Form (DNF) blocking schemes on KGs, as well as structurally heterogeneous tables that may not share a common schema. We evaluate the approach on six real-world dataset pairs, and show that it is competitive with supervised and semi-supervised baselines.

Highlights

  • Entity Resolution (ER) is the identification of co-referent entities across datasets.Different communities refer to it as instance matching, record linkage, and the mergepurge problem [1,2]

  • Overall, when considering statistically significant results, the supervised method typically achieves better RR, but Pairs Completeness (PC) is high for all methods, with the proposed method performing the best on dataset pairs (DPs) 4 and the supervised baseline on DP 2, with high significance. We believe that the former result was obtained because the proposed method has the strongest approximation bounds out of all three systems, and that this effect would be most apparent on large DPs

  • We presented a generic pipeline for learning Disjunctive Normal Form (DNF) blocking schemes on heterogeneous dataset pairs

Read more

Summary

Introduction

Entity Resolution (ER) is the identification of co-referent entities across datasets.Different communities refer to it as instance matching, record linkage, and the mergepurge problem [1,2]. A blocking key, such as ‘Tokens(LastName)’, could first be applied to each node in the two KGs, as shown in the figure. In essence, this is a function that tokenizes the last name of each customer, and it assigns the customer to a block, indexed by the last-name token. If these graphs each contained thousands, or even millions of entities (which is not uncommon), the total number of pairwise comparisons would number in the trillions (106 × 106 ). An entity in one knowledge graph is only linked to a small number (typically, far less than five even) of entities in the other knowledge graph

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call