Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data.

Michael Platzer,Thomas Reutterer

doi:10.3389/fdata.2021.679939

Michael Platzer, Thomas Reutterer

Open Access

https://doi.org/10.3389/fdata.2021.679939

Copy DOI

Abstract

AI-based data synthesis has seen rapid progress over the last several years and is increasingly recognized for its promise to enable privacy-respecting high-fidelity data sharing. This is reflected by the growing availability of both commercial and open-sourced software solutions for synthesizing private data. However, despite these recent advances, adequately evaluating the quality of generated synthetic datasets is still an open challenge. We aim to close this gap and introduce a novel holdout-based empirical assessment framework for quantifying the fidelity as well as the privacy risk of synthetic data solutions for mixed-type tabular data. Measuring fidelity is based on statistical distances of lower-dimensional marginal distributions, which provide a model-free and easy-to-communicate empirical metric for the representativeness of a synthetic dataset. Privacy risk is assessed by calculating the individual-level distances to closest record with respect to the training data. By showing that the synthetic samples are just as close to the training as to the holdout data, we yield strong evidence that the synthesizer indeed learned to generalize patterns and is independent of individual training records. We empirically demonstrate the presented framework for seven distinct synthetic data solutions across four mixed-type datasets and compare these then to traditional data perturbation techniques. Both a Python-based implementation of the proposed metrics and the demonstration study setup is made available open-source. The results highlight the need to systematically assess the fidelity just as well as the privacy of these emerging class of synthetic data generators.

Highlights

Self-supervised generative AI has made significant progress over the past years, with algorithms capable of creating “shockingly” realistic synthetic data across a wide range of domains
These advances are remarkable considering that they do not build upon our own human understanding of the world, but “merely” require a flexible, scalable self-supervised learning algorithm that teaches itself to create novel records based on a sufficient amount of training data
In this paper we introduce and empirically demonstrate a novel, flexible and easy-to-use framework for measuring the fidelity as well as the privacy risk entailed in synthetic data in mixed-type tabular data setting

Summary

INTRODUCTION

Self-supervised generative AI has made significant progress over the past years, with algorithms capable of creating “shockingly” realistic synthetic data across a wide range of domains. Similar progress is made within structured data domains, such as synthesizing medical health records (Choi et al, 2017; Goncalves et al, 2020; Krauland et al, 2020), census data (Freiman et al, 2017), human genoms (Yelmen et al, 2021), website traffic (Lin et al, 2020) or financial transactions (Assefa, 2020) These advances are remarkable considering that they do not build upon our own human understanding of the world, but “merely” require a flexible, scalable self-supervised learning algorithm that teaches itself to create novel records based on a sufficient amount of training data. This will allow us to compare the performance of generative models from the rapidly growing field of synthetic data approaches against each other, as well as against alternative SDC techniques in section Empirical Demonstration

RELATED WORK

FRAMEWORK

Fidelity

Privacy

EMPIRICAL DEMONSTRATION

Findings

DISCUSSION AND FUTURE

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Big Data	Publication Date: Jun 29, 2021
Citations: 14	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Big Data

Lead the way for us

Similar Papers

Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.
Samer El Kababji ... Xi Fang
JCO clinical cancer informatics | VOL. 7
Samer El Kababji, et. al.Samer El Kababji ... Xi Fang
01 Sep 2023
JCO clinical cancer informatics | VOL. 7

A Unified Framework for Quantifying Privacy Risk in Synthetic Data
Matteo Giomi ... Borbála Tasnádi
Proceedings on Privacy Enhancing Technologies | VOL. 2023
Matteo Giomi, et. al.Matteo Giomi ... Borbála Tasnádi
01 Apr 2023
Proceedings on Privacy Enhancing Technologies | VOL. 2023

An evaluation of the replicability of analyses using synthetic health data
Khaled El Emam ... Alaa El-Hussuna
Scientific Reports | VOL. 14
Khaled El Emam, et. al.Khaled El Emam ... Alaa El-Hussuna
24 Mar 2024
Scientific Reports | VOL. 14

Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
Saverio D'Amico ...
Blood | VOL. 140
Saverio D'Amico, et. al.Saverio D'Amico ...
15 Nov 2022
Blood | VOL. 140

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Big Data