Abstract

Due to the decentralized nature of the Semantic Web, the same real-world entity may be described in various data sources with different ontologies and assigned syntactically distinct identifiers. In order to facilitate data utilization and consumption in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such heterogeneous data. This interlinking process is sometimes referred to as Entity Matching , i.e., finding which identifiers refer to the same real-world entity. In this paper, we propose two candidate selection algorithms to improve the scalability of entity matching systems. First of all, we propose HistSim that utilizes the matching histories of the instances to prune instance pairs that are not sufficiently similar to the same pool of other instances. A sigmoid function based thresholding method is proposed to automatically adjust the threshold for such commonality on-the-fly. Furthermore, we propose DisNGram that selects candidate instance pairs by computing a character-level similarity metric on discriminating literal values that are chosen using domain-independent unsupervised learning. Instances are indexed on the chosen predicates’ literal values to enable efficient look-up for similar instances. Finally, in order to be able to handle heterogeneous datasets with a large number of predicates, a mechanism for automatically determining predicate comparability is proposed. We evaluate our two candidate selection algorithms against six state-of-the-art systems on three Semantic Web datasets, and demonstrate that our proposed algorithms frequently outperform state-of-the-art systems on F1-score and runtime.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.