Efficient Entity Matching over Multiple Data Sources with MapReduce

Demetrio Gomes Mestre ,Carlos Eduardo Santos Pires

doi:10.5753/jidm.2014.1518

Demetrio Gomes Mestre , Carlos Eduardo Santos Pires

https://doi.org/10.5753/jidm.2014.1518

Copy DOI

Abstract

The execution of data-intensive tasks such as entity matching on large data sources has become a common demand in the era of Big Data. To face this challenge, cloud computing has proven to be a powerful ally to efficient parallel the execution of such tasks. In this work we investigate how to efficiently perform entity matching over multiple large data sources using the MapReduce programming model. We propose MSBlockSlicer, a MapReduce-based approach that supports blocking techniques to reduce the entity matching search space. The approach utilizes a preprocessing MapReduce job to analyze the data distribution and provides an improved load balancing by applying an efficient block slice strategy as well as a well-known optimization algorithm to assign the generated match tasks. We evaluate our approach against an existing one that addresses the same problem on a real cloud infrastructure. The results show that our approach increases significantly the performance of distributed entity match tasks by reducing the amount of data generated from the map phase and minimizing the execution time.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficient Entity Matching over Multiple Data Sources with MapReduce

Abstract

Talk to us

Similar Papers

More From: Journal of Information and Data Management

Lead the way for us

Journal: Journal of Information and Data Management	Publication Date: Jul 13, 2014
Citations: 4

Similar Papers

Ontology-Based Searching Over Multiple Networked Data Sources
Liang Xue ... Boqin Feng
-
Liang Xue, et. al.Liang Xue ... Boqin Feng
01 Jan 2004
01 Jan 2004

Approach to Classifying Freight Data Elements across Multiple Data Sources
Dan P K Seedah ... Bharathwaj Sankaran
Transportation Research Record: Journal of the Transportation Research Board | VOL. 2529
Dan P K Seedah, et. al.Dan P K Seedah ... Bharathwaj Sankaran
01 Jan 2015
Transportation Research Record: Journal of the Transportation Research Board | VOL. 2529

Clustering on Multi-source Incomplete Data via Tensor Modeling and Factorization
Weixiang Shao ... Lifang He
-
Weixiang Shao, et. al.Weixiang Shao ... Lifang He
01 Jan 2015
01 Jan 2015

Improving load balancing for MapReduce-based entity matching
Demetrio Gomes Mestre ... Carlos Eduardo Santos Pires
-
Demetrio Gomes Mestre, et. al.Demetrio Gomes Mestre ... Carlos Eduardo Santos Pires
01 Jul 2013
01 Jul 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient Entity Matching over Multiple Data Sources with MapReduce

Abstract

Talk to us

Similar Papers

More From: Journal of Information and Data Management