A Blocking Scheme for Entity Resolution in the Semantic Web

Gustavo De Assis Costa,Jose Maria Parente De Oliveira

doi:10.1109/aina.2016.23

Abstract

The amount and diversity of data in the Semantic Web has grown quite. RDF datasets has proportionally more problems than relational datasets due to the way data are published, usually without formal criteria. Entity Resolution isan important issue which is related to a known task of many research communities and it aims at finding all representations that refer to the same entity in different datasets. Yet, it is still an open problem. Blocking methods are used to avoid the quadratic complexity of the brute force approach by clustering entities into blocks and limiting the evaluation of entity specifications to entity pairs within blocks. In the last years only a fewblocking methods were conceived to deal with RDF data and novel blocking techniques are required for dealing with noisy and heterogeneous data in the Web of Data. In this paper we present a blocking scheme, CER-Blocking, which is based on an inverted index structure and that uses different data evidences from a triple, aiming to maximize its effectiveness. To overcomethe problems of data quality or even the very absence thereof, we use two blocking key definitions. This scheme is part of an ER approach which is based on a relational learning algorithm that addresses the problem by statistical approximation. It was empirically evaluated on real and synthetic datasets which are part of consolidated benchmarks found on the literature.

Full Text