SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.

Nicholas Kofi Akortia Hagan,John R Talburt

doi:10.3389/fdata.2024.1446071

Abstract

Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.

Abstract

Talk to us

Similar Papers

More From: Frontiers in big data

Lead the way for us

Journal: Frontiers in big data	Publication Date: Sep 9, 2024
License type: CC BY 4.0

Similar Papers

Three-dimensional Entity Resolution with JedAI
George Papadakis ... Manolis Koubarakis
Information Systems | VOL. 93
George Papadakis, et. al.George Papadakis ... Manolis Koubarakis
27 May 2020
Information Systems | VOL. 93

SystemER
Kun Qian ... Lucian Popa
Proceedings of the VLDB Endowment | VOL. 12
Kun Qian, et. al.Kun Qian ... Lucian Popa
01 Aug 2019
Proceedings of the VLDB Endowment | VOL. 12

Entity Resolution and Blocking: A Review
K.A Vidhya ... T.V Geetha
-
K.A Vidhya, et. al.K.A Vidhya ... T.V Geetha
01 Dec 2019
01 Dec 2019

Exploring Spark-SQL-Based Entity Resolution Using the Persistence Capability
Xiao Chen ... Sravani Mantha
-
Xiao Chen, et. al.Xiao Chen ... Sravani Mantha
01 Jan 2018
01 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.

Abstract

Talk to us

Similar Papers

More From: Frontiers in big data