Deep generative models have gained increasing popularity, particularly in fields such as natural language processing and computer vision. Recently, efforts have been made to extend these advanced algorithms to tabular data. While generative models have shown promising results in creating synthetic data, their high computational demands and the need for careful parameter tuning present significant challenges. This study investigates whether a collective integration of refined synthetic datasets from multiple models can achieve comparable or superior performance to that of a single, large generative model. To this end, we developed a Data-Centric Ensemble Synthetic Data model, leveraging principles of ensemble learning. Our approach involved a data refinement process applied to various synthetic datasets, systematically eliminating noise and ranking, selecting, and combining them to create an augmented, high-quality synthetic dataset. This approach improved both the quantity and quality of the data. Central to this process, we introduced the Ensemble k-Nearest Neighbors with Centroid Displacement (EKCD) algorithm for noise filtering, alongside a density score for ranking and selecting data. Our experiments confirmed the effectiveness of EKCD in removing noisy synthetic samples. Additionally, the ensemble model based on the refined synthetic data substantially enhanced the performance of machine learning models, sometimes even outperforming that of real data.
Read full abstract