Blocking Techniques for Entity Linkage: A Semantics-Based Approach

Fabio Azzalini,Songle Jin,Letizia Tanca,Marco Renzi

doi:10.1007/s41019-020-00146-w

Fabio Azzalini, Songle Jin + Show 2 more

Open Access

https://doi.org/10.1007/s41019-020-00146-w

Copy DOI

Abstract

Nowadays, data integration must often manage noisy data, also containing attribute values written in natural language such as product descriptions or book reviews. In the data integration process, Entity Linkage has the role of identifying records that contain information referring to the same object. Modern Entity Linkage methods, in order to reduce the dimension of the problem, partition the initial search space into “blocks” of records that can be considered similar according to some metrics, comparing then only the records belonging to the same block and thus greatly reducing the overall complexity of the algorithm. In this paper, we propose two automatic blocking strategies that, differently from the traditional methods, aim at capturing the semantic properties of data by means of recent deep learning frameworks. Both methods, in a first phase, exploit recent research on tuple and sentence embeddings to transform the database records into real-valued vectors; in a second phase, to arrange the tuples inside the blocks, one of them adopts approximate nearest neighbourhood algorithms, while the other one uses dimensionality reduction techniques combined with clustering algorithms. We train our blocking models on an external, independent corpus, and then, we directly apply them to new datasets in an unsupervised fashion. Our choice is motivated by the fact that, in most data integration scenarios, no training data are actually available. We tested our systems on six popular datasets and compared their performances against five traditional blocking algorithms. The test results demonstrated that our deep-learning-based blocking solutions outperform standard blocking algorithms, especially on textual and noisy data.

Highlights

The integration of data coming from different sources is today of paramount importance: companies, hospitals, government agencies, banks and many other actors, in order to carry out their everyday activities, need to merge several datasets, e.g. customers databases or patient and pathology records.Technopole, Milan, ItalyIntegrating data in these scenarios may be relatively simple, especially when the data sources have clean and standard attributes, but with the increased use of internet-based services like e-commerce, web sites for comparing products or online libraries, data integration is becoming more challenging
We present the two dimensionality reduction techniques used in our methodology: principal component analysis (PCA) and t-distributed stochastic neighbour embedding (t-SNE)
We first provide the results of the tests between our blocking systems and the traditional methods, and we investigate the impact of different architectural choices of our model

Summary

Introduction

The integration of data coming from different sources is today of paramount importance: companies, hospitals, government agencies, banks and many other actors, in order to carry out their everyday activities, need to merge several datasets, e.g. customers databases or patient and pathology records. Integrating data in these scenarios may be relatively simple, especially when the data sources have clean and standard attributes, but with the increased use of internet-based services like e-commerce, web sites for comparing products or online libraries, data integration is becoming more challenging. The current disruptive growth in dataset sizes makes the problem intractable, since, when the number and the sizes of the datasets

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Data Science and Engineering	Publication Date: Nov 3, 2020
Citations: 21	License type: open-access

R Discovery Prime

R Discovery Prime

Blocking Techniques for Entity Linkage: A Semantics-Based Approach

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data Science and Engineering

Lead the way for us

Similar Papers

An ontology-based documentation of data discovery and integration process in cancer outcomes research
Hansi Zhang ... Yi Guo
BMC Medical Informatics and Decision Making | VOL. 20
Hansi Zhang, et. al.Hansi Zhang ... Yi Guo
01 Dec 2020
BMC Medical Informatics and Decision Making | VOL. 20

Large-scale entity extraction and probabilistic record linkage
Flavio Villanustre
-
Flavio VillanustreFlavio Villanustre
01 May 2014
01 May 2014

WebLens: Towards Web-scale Data Integration, Training the Models
Rituparna Khan ... Michael Gubanov
-
Rituparna Khan, et. al.Rituparna Khan ... Michael Gubanov
10 Dec 2020
10 Dec 2020

Graph-Based Jointly Modeling Entity Detection and Linking in Domain-Specific Area
Jiangtao Zhang ... Juanzi Li
-
Jiangtao Zhang, et. al.Jiangtao Zhang ... Juanzi Li
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Blocking Techniques for Entity Linkage: A Semantics-Based Approach

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Data Science and Engineering