Entity Resolution Research Articles

Background Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released pseudopeople, a Python package that allows users to generate simulated datasets approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information. Methods We created the simulated population data available through pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems. Results Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.

Read full abstract

Traditional data curation processes typically depend on human intervention. As data volume and variety grow exponentially, organizations are striving to increase efficiency of their data processes by automating manual processes and making them as unsupervised as possible. An additional challenge is to make these unsupervised processes scalable to meet the demands of increased data volume. This paper describes the parallelization of an unsupervised entity resolution (ER) process. ER is a component of many different data curation processes because it clusters records from multiple data sources that refer to the same real-world entity, such as the same customer, patient, or product. The ability to scale ER processes is particularly important because the computation effort of ER increases quadratically with data volume. The Data Washing Machine (DWM) is an already proposed unsupervised ER system which clusters references from diverse data sources. This work aims at solving the single-threaded nature of the DWM by adopting the parallelization nature of Hadoop MapReduce. However, the proposed parallelization method can be applied to both supervised systems, where matching rules are created by experts, and unsupervised systems, where expert intervention is not required. The DWM uses an entropy measure to self-evaluate the quality of record clustering. The current single-threaded implementations of the DWM in Python and Java are not scalable beyond a few 1,000 records and rely on large, shared memory. The objective of this research is to solve the major two shortcomings of the current design of the DWM which are the creation and usage of shared memory and lack of scalability by leveraging on the power of Hadoop MapReduce. We propose Hadoop Data Washing Machine (HDWM), a MapReduce implementation of the legacy DWM. The scalability of the proposed system is displayed using publicly available ER datasets. Based on results from our experiment, we conclude that HDWM can cluster from 1,000's to millions of equivalent references using multiple computational nodes with independent RAM and CPU cores.

Read full abstract

Entity Resolution Research Articles

Related Topics

Articles published on Entity Resolution

T-KAER: Transparency-enhanced Knowledge-Augmented Entity Resolution Framework

Making It Tractable to Detect and Correct Errors in Graphs

Pre-trained models for linking process in data washing machine

Exploring other clustering methods and the role of Shannon Entropy in an unsupervised setting

The use of Supervised Learning to perform pairwise classification for Record Linkage over real world data

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.

PyJedAI: A Library with Resolution-Related Structures and Procedures for Products

Code and Data Repository for pyJedAI: A Library with Resolution-Related Structures and Procedures for Products

Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domain

PANNER: POS-Aware Nested Named Entity Recognition Through Heterogeneous Graph Neural Network

Graph Association Analyses for Early Drug Discovery

Rock: Cleaning Data with both ML and Logic Rules

Low-resource entity resolution with domain generalization and active learning

An efficient approach for discovering Graph Entity Dependencies (GEDs)

Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modeling

Enhancing Entity Resolution with a hybrid Active Machine Learning framework: Strategies for optimal learning in sparse datasets

Convergence Diagnostics for Entity Resolution

ERABQS: entity resolution based on active machine learning and balancing query strategy

Multipartite Entity Resolution: Motivating a K-Tuple Perspective (Student Abstract)

A scalable MapReduce-based design of an unsupervised entity resolution system.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Entity Resolution Research Articles

Related Topics

Articles published on Entity Resolution

T-KAER: Transparency-enhanced Knowledge-Augmented Entity Resolution Framework

Making It Tractable to Detect and Correct Errors in Graphs

Pre-trained models for linking process in data washing machine

Exploring other clustering methods and the role of Shannon Entropy in an unsupervised setting

The use of Supervised Learning to perform pairwise classification for Record Linkage over real world data

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.

PyJedAI: A Library with Resolution-Related Structures and Procedures for Products

Code and Data Repository for pyJedAI: A Library with Resolution-Related Structures and Procedures for Products

Cost-efficient prompt engineering for unsupervised entity resolution in the product matching domain

PANNER: POS-Aware Nested Named Entity Recognition Through Heterogeneous Graph Neural Network

Graph Association Analyses for Early Drug Discovery

Rock: Cleaning Data with both ML and Logic Rules

Low-resource entity resolution with domain generalization and active learning

An efficient approach for discovering Graph Entity Dependencies (GEDs)

Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modeling

Enhancing Entity Resolution with a hybrid Active Machine Learning framework: Strategies for optimal learning in sparse datasets

Convergence Diagnostics for Entity Resolution

ERABQS: entity resolution based on active machine learning and balancing query strategy

Multipartite Entity Resolution: Motivating a K-Tuple Perspective (Student Abstract)

A scalable MapReduce-based design of an unsupervised entity resolution system.