Abstract

Data generation techniques are one of the emerging trends in machine learning in the last decade. Despite huge data availability, small datasets are still an issue to tackle for decision making purposes. Synthetic data generation is a promising alternative for the small dataset problem. In addition, previous methodologies address the data generation for only one of the tasks: supervised or unsupervised. A modified Mega-Trend Diffusion (MTD) approach, k-Nearest Neighbor Mega-Trend Diffusion (kNNMTD), is proposed in this research to address these challenges. The method identifies the closest subsamples using the k-Nearest Neighbors (kNN) algorithm and applies MTD to the subsample neighbors to estimate the domain ranges. The proposed methodology provides the functionality to generate data for any data-driven tasks. kNNMTD is compared with baseline MTD, CTGAN, and synthetic minority oversampling technique (SMOTE) for classification tasks as well as against SMOTE for regression (SmoteR) for regression tasks. The proposed method is validated using some of the benchmark datasets as well as the simulated datasets along with a case study. Pairwise correlation difference (PCD) is used to compare the similarity between real and synthetic datasets. kNNMTD outperforms baseline MTD and CTGAN on all the datasets and shows statistical significance of the proposed methodology. Some of the benchmark datasets show low average PCD values as well as the statistical differences against SMOTE and SmoteR using kNNMTD. In the case study, kNNMTD generate data with the lowest PCD values compared to the other methods for both classification (1.2077) and ordinal regression (1.6017) tasks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.