Abstract

Entity resolution (ER), the problem of identifying and linking records that belong to the same real-world entities in structured and unstructured data, is a primary task in data integration. Accurate and efficient ER has a major practical impact on various applications across commercial, security and scientific domains. Recently, scalable ER techniques have received enormous attention with the increasing need to combine large-scale datasets. The shortage of training and ground truth data impedes the development and testing of ER algorithms. Good public datasets, especially those containing personal information, are restricted in this area and usually small in size. Due to privacy and confidential issues, testing algorithms or techniques with real datasets is challenging in ER research. Simulation is one technique for generating synthetic datasets that have characteristics similar to those of real data for testing algorithms. Many existing simulation tools in ER lack support for generating large-scale data and have problems in complexity, scalability, and limitations of resampling. In our work, we propose a simple, inexpensive, and fast synthetic data generation tool. Our tool only generates entity names in the first stage, but these are commonly used as identification keys in ER algorithms. We avoid the detail-level simulation of entity names using a simple vector representation that delivers simplicity and efficiency. In this paper, we discuss how to simulate simple vectors that approximate the properties of entity names. We describe the overall construction of the tool based on data analysis of a namespace that contains entity names collected from the actual environment.

Highlights

  • Data integration plays a vital role in data analysis and mining projects by combining data from different sources into meaningful information

  • We developed our simulation model following data analysis of a namespace that contains entity names collected from the actual environment

  • Based on the results of the data analysis, we propose a numerical simulation model that generates name-like vectors

Read more

Summary

Introduction

Data integration plays a vital role in data analysis and mining projects by combining data from different sources into meaningful information. Entity resolution (ER), a core step in data integration, detects entity records across multiple databases that correspond to the same real-world entity. ER has been widely recognised in academic and statistical research since research data are gathered from multiple data sources that store data in different formats. This process is of increasing importance in commercial and government practice. Vatsalan et al [8] presented a survey of existing techniques that match and link databases between organizations considering the privacy aspects of the data. Christophides et al [9] reviewed ER techniques in the context of big data, whereas Barlaug et al [10] provided an up-to-date survey of deep neural networks in entity matching. We survey only a few relevant works that align with the focus of our work

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call