Towards the efficient parallelization of multi-pass adaptive blocking for entity matching

Demetrio Gomes Mestre,Carlos Eduardo Santos Pires,Dimas Cassimiro Nascimento

doi:10.1016/j.jpdc.2016.11.002

Abstract

Modern parallel computing programming models, such as MapReduce (MR), have proven to be powerful tools for efficient parallel execution of data-intensive tasks such as Entity Matching (EM) in the era of Big Data. For this reason, studies about challenges and possible solutions of how EM can benefit from this well-known cloud computing programming model have become an important demand nowadays. Furthermore, the effectiveness and scalability of MR-based implementations for EM depend on how well the workload distribution is balanced among all reduce tasks. In this article, we investigate how MapReduce can be used to perform efficient (load balanced) parallel EM using a variation of the multi-pass Sorted Neighborhood Method (SNM) that uses a varying size (adaptive) window. We propose Multi-pass MapReduce Duplicate Count Strategy (MultiMR-DCS++), a MR-based approach for multi-pass adaptive SNM, aiming to increase even more the performance of the SNM. The evaluation results based on real-world datasets and cluster infrastructure show that our approach increases the performance of MapReduce-based SNM regarding the EM execution time and detection quality.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Towards the efficient parallelization of multi-pass adaptive blocking for entity matching

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing

Lead the way for us

Journal: Journal of Parallel and Distributed Computing	Publication Date: Nov 14, 2016
Citations: 5

Similar Papers

Adaptive sorted neighborhood blocking for entity matching with MapReduce
Demetrio Gomes Mestre ... Dimas C Nascimento
-
Demetrio Gomes Mestre, et. al.Demetrio Gomes Mestre ... Dimas C Nascimento
13 Apr 2015
13 Apr 2015

An efficient spark-based adaptive windowing for entity matching
Demetrio Gomes Mestre ... Andreza Raquel Monteiro De Queiroz
Journal of Systems and Software | VOL. 128
Demetrio Gomes Mestre, et. al.Demetrio Gomes Mestre ... Andreza Raquel Monteiro De Queiroz
06 Mar 2017
Journal of Systems and Software | VOL. 128

Efficient Entity Matching over Multiple Data Sources with MapReduce
...
Journal of Information and Data Management | VOL. 5
, et. al. ...
13 Jul 2014
Journal of Information and Data Management | VOL. 5

Chapter 8 Context-Based Entity Matching for Big Data
Mayesha Tasnim ... Damien Graux
-
Mayesha Tasnim, et. al.Mayesha Tasnim ... Damien Graux
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards the efficient parallelization of multi-pass adaptive blocking for entity matching

Abstract

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing