GenerativeMTD: A deep synthetic data generation framework for small datasets

Karthik Ramamurthy,Daehan Won,Menaka Radhakrishnan,Jayanth Sivakumar

doi:10.1016/j.knosys.2023.110956

Karthik Ramamurthy, Daehan Won + Show 2 more

https://doi.org/10.1016/j.knosys.2023.110956

Copy DOI

Abstract

Synthetic data generation for tabular data unlike computer vision, is an emerging challenge. When tabular data needs to be synthesized, it either faces a small dataset problem or violates privacy if the data contains sensitive information. When the data is small, any data-driven modeling leads to biased decision making. On the other hand, deep learning models that use small dataset for training are limited. Tabular data also faces a myriad of challenges, such as mixed data types, fidelity, mode collapse, etc. To eradicate small dataset problems and increase the deep learning capabilities on small data, a new generative method, GenerativeMTD, is proposed in this research. The method generates fake data by using pseudo-real data as input during the training. Pseudo-real data serves the purpose of training the deep learning model with large samples when the real dataset size is small. The pseudo-real data is generated from the real data through k-nearest neighbor mega-trend diffusion. This pseudo-real data is then translated into synthetic data that is similar and realistic to the real data. The method outperforms some of the state-of-the-art methodologies that exist in tabular data generation. The proposed method also generates quality synthetic data for the benchmark datasets in terms of pairwise correlation differences. In addition, the method surpasses the benchmark models in terms of the distance-based privacy metrics: distance to the closest record and nearest neighbor distance ratio.

Full Text