An Effective Entity Resolution Approach for Big Data

Randa Mohamed Abd El-Ghafar,Eman S Nasr,Ali H El-Bastawissy,Mervat H Gheith

doi:10.35940/ijitee.k9503.09101121

Randa Mohamed Abd El-Ghafar, Eman S Nasr + Show 2 more

Open Access

https://doi.org/10.35940/ijitee.k9503.09101121

Copy DOI

Abstract

Entity Resolution (ER) is defined as the process 0f identifying records/ objects that correspond to real-world objects/ entities. To define a good ER approach, the schema of the data should be well-known. In addition, schema alignment of multiple datasets is not an easy task and may require either domain expert or ML algorithm to select which attributes to match. Schema agnostic meta-blocking tries to solve such a problem by considering each token as a blocking key regardless of the attributes it appears in. It may also be coupled with meta-blocking to reduce the number of false negatives. However, it requires the exact match of tokens which is very hard to occur in the actual datasets and it results in very low precision. To overcome such issues, we propose a novel and efficient ER approach for big data implemented in Apache Spark. The proposed approach is employed to avoid schema alignment as it treats the attributes as a bag of words and generates a set of n-grams which is transformed to vectors. The generated vectors are compared using a chosen similarity measure. The proposed approach is a generic one as it can accept all types of datasets. It consists of five consecutive sub-modules: 1) Dataset acquisition, 2) Dataset pre-processing, 3) Setting selection criteria, where all settings of the proposed approach are selected such as the used blocking key, the significant attributes, NLP techniques, ER threshold, and the used scenario of ER, 4) ER pipeline construction, and 5) Clustering where the similar records are grouped into the similar cluster. The ER pipeline could accept two types of attributes; the Weighted Attributes (WA) or the Compound Attributes (CA). In addition, it accepts all the settings selected in the fourth module. The pipeline consists of five phases. Phase 1) Generating the tokens composing the attributes. Phase 2) Generating n-grams of length n. Phase 3) Applying the hashing Text Frequency (TF) to convert each n-grams to a fixed-length feature vector. Phase 4) Applying Locality Sensitive Hashing (LSH), which maps similar input items to the same buckets with a higher probability than dissimilar input items. Phase 5) Classification of the objects to duplicates or not according to the calculated similarity between them. We introduced seven different scenarios as an input to the ER pipeline. To minimize the number of comparisons, we proposed the length filter which greatly contributes to improving the effectiveness of the proposed approach as it achieves the highest F-measure between the existing computational resources and scales well with the available working nodes. Three results have been revealed: 1) Using the CA in the different scenarios achieves better results than the single WA in terms of efficiency and effectiveness. 2) Scenario 3 and 4 Achieve the best performance time because using Soundex and Stemming contribute to reducing the performance time of the proposed approach. 3) Scenario 7 achieves the highest F-measure because by utilizing the length filter, we only compare records that are nearly within a pre-determined percentage of increase or decrease of string length. LSH is used to map the same inputs items to the buckets with a higher probability than dis-similar ones. It takes numHashTables as a parameter. Increasing the number of candidate pairs with the same numHashTables will reduce the accuracy of the model. Utilizing the length filter helps to minimize the number of candidates which in turn increases the accuracy of the approach.

Highlights

The representation of real-world objects is called profile or entity
When the entities arise from two diverse data sources (i.e., ε = ε1 U ε2), and each source is free of duplicates, the problem is called Clean-Clean ER, otherwise, when the entities originate from the same dataset that has duplicates, the problem is called dirty ER [1]
We proposed an effective and approximate ER approach for big data implemented in Apache Spark

Summary

INTRODUCTION

The representation of real-world objects is called profile or entity. Entity profile composes of a unique identifier and a set of (name, value) pairs. 2) Blocking: Traditional ER approaches depend on applying matching techniques on a Cartesian product of n inputs entities These approaches result in a complexity of O(n2) which causes a very high increase in the execution time for big datasets. It requires the exact match of tokens, which is very hard to occur in real-world datasets, and it results in very low precision To overcome such issues, we proposed an effective and approximate ER approach for big data implemented in Apache Spark. Utilizing Apache Spark helps to address many limitations of using the MapReduce frameworks such as the data skew problem that happens due to the unequal block sizes and results in unbalanced entity pairs which in turn lead to severe imbalances in the reduce phase. 6) Proposing seven different scenarios using different NLP techniques and evaluating them to choose the most efficient and effective one

RELATED WORK

THE PROPOSED ENTITY RESOLUTION APPROACH

Entity Resolution Pipeline Module

Clustering Module

Implementing the proposed seven scenarios using multi-nodes

Findings

CONCLUSION AND FUTURE WORK

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Innovative Technology and Exploring Engineering	Publication Date: Sep 30, 2021
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

An Effective Entity Resolution Approach for Big Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Innovative Technology and Exploring Engineering

Lead the way for us

Similar Papers

Exploiting context analysis for combining multiple entity resolution systems
Dmitri V Kalashnikov ... Sharad Mehrotra
-
Dmitri V Kalashnikov, et. al.Dmitri V Kalashnikov ... Sharad Mehrotra
29 Jun 2009
29 Jun 2009

ERGP: A Combined Entity Resolution Approach with Genetic Programming
Derong Shen ... Ge Yu
-
Derong Shen, et. al.Derong Shen ... Ge Yu
01 Sep 2014
01 Sep 2014

Noise-Tolerant Approximate Blocking for Dynamic Real-Time Entity Resolution
Huizhi Liang ... Yanzhe Wang
-
Huizhi Liang, et. al.Huizhi Liang ... Yanzhe Wang
01 Jan 2014
01 Jan 2014

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution
...
-
, et. al. ...
01 Dec 2018
01 Dec 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Effective Entity Resolution Approach for Big Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Innovative Technology and Exploring Engineering