An efficient spark-based adaptive windowing for entity matching

Demetrio Gomes Mestre,Tiago Brasileiro Araujo,Dimas Cassimiro Nascimento,Carlos Eduardo Santos Pires,Veruska Borges Santos,Andreza Raquel Monteiro De Queiroz

doi:10.1016/j.jss.2017.03.003

Abstract

Entity Matching (EM), i.e., the task of identifying records that refer to the same entity, is a fundamental problem in every information integration and data cleansing system, e.g., to find similar product descriptions in databases. The EM task is known to be challenging when the datasets involved in the matching process have a high volume due to its pair-wise nature. For this reason, studies about challenges and possible solutions of how EM can benefit from modern parallel computing programming models, such as Apache Spark (Spark), have become an important demand nowadays (Christen, 2012a; Kolb et al., 2012b). The effectiveness and scalability of Spark-based implementations for EM depend on how well the workload distribution is balanced among all workers. In this article, we investigate how Spark can be used to perform efficiently (load balanced) parallel EM using a variation of the Sorted Neighborhood Method (SNM) that uses a varying (adaptive) window size. We propose Spark Duplicate Count Strategy (S-DCS++), a Spark-based approach for adaptive SNM, aiming to increase even more the performance of this method. The evaluation results, based on real-world datasets and cluster infrastructure, show that our approach increases the performance of parallel DCS++ regarding the EM execution time.

Full Text