Abstract

Real datasets often lack values, compromising the quality of data analyses. Adequate data may be synthetically imputed to replace missing values – a technique known as missing data imputation – avoiding deletion of incomplete observations. Several data imputation methods have been proposed and generative methods based on Artificial Neural Networks (ANN) are successful alternatives to discriminative methods. In this extended version of our work presented at the International Conference on Computational Science Neves et al. (2021), we propose three novel data imputation methods based on Generative Adversarial Networks (GAN): SGAIN, WSGAIN-CP, and WSGAIN-GP.We further studied how data imputation methods can be used to generate fully synthetic datasets. Among other benefits, the generation of synthetic data can help to mitigate legal, ethical, and data privacy issues, as well as to augment original data. In this context, we introduce tabulator, which is a novel meta-method for synthetic data generation that uses the data imputation methods as back-end engines for tabular data generation.We evaluated our data imputation methods using datasets with different amputation rates following the Missing Completely At Random (MCAR) setting. The results show that our methods are en-par or outperform state-of-the-art imputation methods in terms of response time and the quality of imputed data. We further evaluated and compared our data generation methods, which were derived from tabulator, with a state-of-the-art approach, the Conditional Tabular GAN (CTGAN). The evaluation results show that our tabulator methods outperform CTGAN in many cases, for example regarding the accuracy of machine learning tasks (e.g., prediction or classification) performed on the synthetic output data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call