Synthesizing Privacy Preserving Entity Resolution Datasets

Xuedi Qinl,Guoliang Li,Nan Tang,Yaoyu Zhu,Jian Li,Yuyu Luo,Chengliang Chai

doi:10.1109/icde53745.2022.00222

Abstract

Entity resolution (ER) is a core problem in data integration. Many companies have lots of datasets where ER needs to be conducted to integrate the data. On the one hand, it is nontrivial for non-ER experts within companies to design ER solutions. On the other hand, most companies are reluctant to release their real datasets for multiple reasons (e.g., privacy issues). A typical solution from the machine learning (ML) and the statistical community is to create surrogate (a.k.a. analogous) datasets based on the real dataset, release these surrogate datasets to the public to train ML models, such that these models trained on surrogate datasets can be either directly used or be adapted for the real dataset by the companies. In this paper, we study a new problem of synthesizing surrogate ER datasets using transformer models, with the goal that the ER model trained on the synthesized dataset can be used directly on the real dataset. We propose privacy preserving methods to synthesize ER datasets: we first learn the true similarity distributions of both matching and non-matching entity pairs from real dataset. We then devise algorithms that satisfy differential privacy and can synthesize fake but semantically meaningful entities, add matching and non-matching labels to these fake entity pairs, and ensure that the fake and real datasets have similar distributions. We also describe a method for entity rejection to avoid synthesizing bad fake entities that may destroy the original distributions. Extensive experiments show that ER matchers trained on real and synthetic ER datasets have very close performance on the same test sets - their <tex>$F1$</tex> scores differ within 6% on 3 commonly used ER datasets, and their average precision, recall differences are less than 5%. Index Terms-Data Synthesis, Entity Resolution

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Synthesizing Privacy Preserving Entity Resolution Datasets

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Do You Consent to the Use of Your Biological Data for Training ML and AI Models? Online Survey Targeting Clinicians and Researchers.
Yury Rusinovich ... Volha Rusinovich
Web3 Journal: ML in Health Science | VOL. 1
Yury Rusinovich, et. al.Yury Rusinovich ... Volha Rusinovich
27 Jan 2024
Web3 Journal: ML in Health Science | VOL. 1

Machine learning (ML) based models for predicting the ultimate strength of rectangular concrete-filled steel tube (CFST) columns under eccentric loading
Chen Wang ... Tak-Ming Chan
Engineering Structures | VOL. 276
Chen Wang, et. al.Chen Wang ... Tak-Ming Chan
12 Dec 2022
Engineering Structures | VOL. 276

MLaaS4HEP: Machine Learning as a Service for HEP
Valentin Kuznetsov ... Luca Giommi
Computing and Software for Big Science | VOL. 5
Valentin Kuznetsov, et. al.Valentin Kuznetsov ... Luca Giommi
05 Jul 2021
Computing and Software for Big Science | VOL. 5

Disclosure control of machine learning models from trusted research environments (TRE): New challenges and opportunities
Esma Mansouri-Benssassi ... Emily Jefferson
Heliyon | VOL. 9
Esma Mansouri-Benssassi, et. al.Esma Mansouri-Benssassi ... Emily Jefferson
01 Apr 2023
Heliyon | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Synthesizing Privacy Preserving Entity Resolution Datasets

Abstract

Talk to us

Similar Papers