On the Quality of Synthetic Generated Tabular Data

Erica Espinosa,Alvaro Figueira

doi:10.3390/math11153278

Abstract

Class imbalance is a common issue while developing classification models. In order to tackle this problem, synthetic data have recently been developed to enhance the minority class. These artificially generated samples aim to bolster the representation of the minority class. However, evaluating the suitability of such generated data is crucial to ensure their alignment with the original data distribution. Utility measures come into play here to quantify how similar the distribution of the generated data is to the original one. For tabular data, there are various evaluation methods that assess different characteristics of the generated data. In this study, we collected utility measures and categorized them based on the type of analysis they performed. We then applied these measures to synthetic data generated from two well-known datasets, Adults Income, and Liar+. We also used five well-known generative models, Borderline SMOTE, DataSynthesizer, CTGAN, CopulaGAN, and REaLTabFormer, to generate the synthetic data and evaluated its quality using the utility measures. The measurements have proven to be informative, indicating that if one synthetic dataset is superior to another in terms of utility measures, it will be more effective as an augmentation for the minority class when performing classification tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mathematics	Publication Date: Jul 26, 2023
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

On the Quality of Synthetic Generated Tabular Data

Abstract

Talk to us

Similar Papers

More From: Mathematics

Lead the way for us

Similar Papers

Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy
Chang Sun ... Michel Dumontier
Journal of Biomedical Informatics | VOL. 143
Chang Sun, et. al.Chang Sun ... Michel Dumontier
01 Jun 2023
Journal of Biomedical Informatics | VOL. 143

A Comparative Study of Synthetic Over-sampling Method to Improve the Classification of Poor Households in Yogyakarta Province
B Santoso ... H Wijayanto
IOP Conference Series: Earth and Environmental Science | VOL. 187
B Santoso, et. al.B Santoso ... H Wijayanto
01 Nov 2018
IOP Conference Series: Earth and Environmental Science | VOL. 187

Generation of Synthetic Tabular Healthcare Data Using Generative Adversarial Networks
Alireza Hossein Zadeh Nik ... Michael A Riegler
-
Alireza Hossein Zadeh Nik, et. al.Alireza Hossein Zadeh Nik ... Michael A Riegler
01 Jan 2023
01 Jan 2023

OBGAN: Minority oversampling near borderline with generative adversarial networks
Wonkeun Jo ... Dongil Kim
Expert Systems with Applications | VOL. 197
Wonkeun Jo, et. al.Wonkeun Jo ... Dongil Kim
26 Feb 2022
Expert Systems with Applications | VOL. 197

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On the Quality of Synthetic Generated Tabular Data

Abstract

Talk to us

Similar Papers

More From: Mathematics