Linking Heterogeneous Data in the Semantic Web Using Scalable and Domain-Independent Candidate Selection

Dezhao Song,Jeff Heflin,Yi Luo

doi:10.1109/tkde.2016.2606399

Dezhao Song, Jeff Heflin + Show 1 more

https://doi.org/10.1109/tkde.2016.2606399

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Due to the decentralized nature of the Semantic Web, the same real-world entity may be described in various data sources with different ontologies and assigned syntactically distinct identifiers. In order to facilitate data utilization and consumption in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such heterogeneous data. This interlinking process is sometimes referred to as Entity Matching , i.e., finding which identifiers refer to the same real-world entity. In this paper, we propose two candidate selection algorithms to improve the scalability of entity matching systems. First of all, we propose HistSim that utilizes the matching histories of the instances to prune instance pairs that are not sufficiently similar to the same pool of other instances. A sigmoid function based thresholding method is proposed to automatically adjust the threshold for such commonality on-the-fly. Furthermore, we propose DisNGram that selects candidate instance pairs by computing a character-level similarity metric on discriminating literal values that are chosen using domain-independent unsupervised learning. Instances are indexed on the chosen predicates’ literal values to enable efficient look-up for similar instances. Finally, in order to be able to handle heterogeneous datasets with a large number of predicates, a mechanism for automatically determining predicate comparability is proposed. We evaluate our two candidate selection algorithms against six state-of-the-art systems on three Semantic Web datasets, and demonstrate that our proposed algorithms frequently outperform state-of-the-art systems on F1-score and runtime.

Full Text