Schema-agnostic blocking for streaming data

Tiago Brasileiro Araújo,Kostas Stefanidis,Thiago Pereira Da Nóbrega,Jyrki Nummenmaa,Carlos Eduardo Santos Pires

doi:10.1145/3341105.3375776

Tiago Brasileiro Araújo, Kostas Stefanidis + Show 3 more

https://doi.org/10.1145/3341105.3375776

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Currently, a wide number of information systems produce a large amount of data continuously. Since these sources may have overlapping knowledge, the Entity Resolution (ER) task emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between entities. Considering the quadratic cost of the ER task, blocking techniques are often used to improve efficiency. Such techniques face two main challenges related to data volume (i.e., large data sources) and variety (i.e., heterogeneous data). Besides these challenges, blocking techniques also face two other ones: streaming data and incremental processing. To address these four challenges simultaneously, we propose PI-Block, a novel incremental schema-agnostic blocking technique that utilizes parallelism (through distributed computational infrastructure) to enhance blocking efficiency. In our experimental evaluation, we use four real-world data source pairs, and highlight that PI-Block achieves better results regarding efficiency and effectiveness compared to the state-of-the-art technique.

Full Text