Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

Xiao Chen,Eike Schallehn,Gunter Saake,Roman Zoun,Sravani Mantha,Kirity Rapuru

doi:10.1007/978-3-319-99987-6_1

Abstract

Entity Resolution (ER) is a task to identify records that refer to the same real-world entities. A naive way to solve ER tasks is to calculate the similarity of the Cartesian product of all records, which is called pair-wise ER and leads to quadratic time complexity. Faced with an exploding data volume, pair-wise ER is challenged to achieve high efficiency and scalability. To tackle this challenge, parallel computing is proposed for speeding up the ER process. Due to the difficulty of distributed programming, big data processing frameworks are often used as tools to ease the realization of parallel ER, supporting data partitioning, workload balancing, and fault tolerance. However, the efficiency and scalability of parallel ER is also influenced by the adopted framework. In the area of parallel ER, the adoption of Apache Spark, a general framework supporting in-memory computation, still is not widely studied. Furthermore, though Apache Spark provides both low-level (RDD-based) and high-level APIs (Datasets-based), to date, only RDD-based APIs have been adopted in parallel ER research. In this paper, we have implemented a Spark-SQL-based ER process and explored its persistence capability to see the performance benefits. We have evaluated its speedup and compared its efficiency to Spark-RDD-based ER. We observed that different persistence options have a large impact on the efficiency of Spark-SQL-based ER, requiring a careful consideration for choosing it. By adopting the best persistence option, the efficiency of our Spark-SQL-based ER implementation is improved up to 3 times on different datasets, over a baseline without any persistence option or with misconfigured persistence.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Indexing Techniques for Real-Time Entity Resolution

-

01 Mar 2016
01 Mar 2016

Enhancing Loosely Schema-aware Entity Resolution with User Interaction
Giovanni Simonini ... Luca Gagliardelli
-
Giovanni Simonini, et. al.Giovanni Simonini ... Luca Gagliardelli
01 Jul 2018
01 Jul 2018

Entity Resolution: Overview and Challenges
Hector Garcia-Molina
-
Hector Garcia-MolinaHector Garcia-Molina
01 Jan 2004
01 Jan 2004

Three-dimensional Entity Resolution with JedAI
George Papadakis ... Manolis Koubarakis
Information Systems | VOL. 93
George Papadakis, et. al.George Papadakis ... Manolis Koubarakis
27 May 2020
Information Systems | VOL. 93

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability

Abstract

Talk to us

Similar Papers