Generating Name-Like Vectors for Testing Large-Scale Entity Resolution

Samudra Herath,Gary Glonek,Matthew Roughan

doi:10.1109/access.2021.3122451

Abstract

Entity resolution (ER), the problem of identifying and linking records that belong to the same real-world entities in structured and unstructured data, is a primary task in data integration. Accurate and efficient ER has a major practical impact on various applications across commercial, security and scientific domains. Recently, scalable ER techniques have received enormous attention with the increasing need to combine large-scale datasets. The shortage of training and ground truth data impedes the development and testing of ER algorithms. Good public datasets, especially those containing personal information, are restricted in this area and usually small in size. Due to privacy and confidential issues, testing algorithms or techniques with real datasets is challenging in ER research. Simulation is one technique for generating synthetic datasets that have characteristics similar to those of real data for testing algorithms. Many existing simulation tools in ER lack support for generating large-scale data and have problems in complexity, scalability, and limitations of resampling. In our work, we propose a simple, inexpensive, and fast synthetic data generation tool. Our tool only generates entity names in the first stage, but these are commonly used as identification keys in ER algorithms. We avoid the detail-level simulation of entity names using a simple vector representation that delivers simplicity and efficiency. In this paper, we discuss how to simulate simple vectors that approximate the properties of entity names. We describe the overall construction of the tool based on data analysis of a namespace that contains entity names collected from the actual environment.

Highlights

Data integration plays a vital role in data analysis and mining projects by combining data from different sources into meaningful information
We developed our simulation model following data analysis of a namespace that contains entity names collected from the actual environment
Based on the results of the data analysis, we propose a numerical simulation model that generates name-like vectors

Summary

Introduction

Data integration plays a vital role in data analysis and mining projects by combining data from different sources into meaningful information. Entity resolution (ER), a core step in data integration, detects entity records across multiple databases that correspond to the same real-world entity. ER has been widely recognised in academic and statistical research since research data are gathered from multiple data sources that store data in different formats. This process is of increasing importance in commercial and government practice. Vatsalan et al [8] presented a survey of existing techniques that match and link databases between organizations considering the privacy aspects of the data. Christophides et al [9] reviewed ER techniques in the context of big data, whereas Barlaug et al [10] provided an up-to-date survey of deep neural networks in entity matching. We survey only a few relevant works that align with the focus of our work

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Generating Name-Like Vectors for Testing Large-Scale Entity Resolution

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Journal: IEEE Access	Publication Date: Jan 1, 2021
License type: CC BY 4.0

Similar Papers

Entity Resolution: Overview and Challenges
Hector Garcia-Molina
-
Hector Garcia-MolinaHector Garcia-Molina
01 Jan 2004
01 Jan 2004

ERGP: A Combined Entity Resolution Approach with Genetic Programming
Chenchen Sun ... Yue Kou
-
Chenchen Sun, et. al.Chenchen Sun ... Yue Kou
01 Sep 2014
01 Sep 2014

Enhancing Loosely Schema-aware Entity Resolution with User Interaction
Giovanni Simonini ... Luca Gagliardelli
-
Giovanni Simonini, et. al.Giovanni Simonini ... Luca Gagliardelli
01 Jul 2018
01 Jul 2018

A Survey on Blocking Technology of Entity Resolution
Bo-Han Li ... Shuo Wan
Journal of Computer Science and Technology | VOL. 35
Bo-Han Li, et. al.Bo-Han Li ... Shuo Wan
01 Jul 2020
Journal of Computer Science and Technology | VOL. 35

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Generating Name-Like Vectors for Testing Large-Scale Entity Resolution

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access