From shallows to depths: unveiling hybrid synthetic data modeling for enhanced learning with privacy considerations in naturally imbalanced datasets

K A Bhat,S A Sofi

doi:10.1080/1206212x.2024.2409989

Abstract

Data serves as the foundational element that drives model development and performance in machine learning and deep learning. The learning algorithms are as good as data on which they are trained on. Data constrained environments pose serious threat to the effectiveness of learning algorithms. Data limitations stem from a range of factors, including regulatory restrictions, privacy concerns, and the inherent scarcity of relevant data. This constrained availability of data often leads to class imbalance problem in the context of classification tasks. To address the challenges of limited data, the algorithmic generated data, also known as synthetic data, is gaining significant traction as a cost-effective, readily available, and secure alternative. Synthetic data generation techniques can be employed to enhance dataset size by augmenting data samples and to address class imbalance by increasing the number of minority class instances. These techniques generally fall into two main categories: distance-based shallow models and probability estimation-based deep generative models. Shallow interpolation-based models generate new data points within the local space between existing data points, while deep density estimation based models generate new data by learning the whole distribution of data. . In the context of smaller datasets, deep generative models often struggle to accurately estimate the probability distribution of whole data. To effectively represent the global data distribution, these deep models require initial starting data samples to guide their approximation. This paper examines the potential of integration of shallow and deep generative models in the data generation pipeline for effective synthetic data augmentation which furthers enhanced learning and generalization of downstream tasks. In this work, we present a hybrid approach of tabular data generation involving mixed type data attributes (continuous, discrete) and pay special attention to data imbalance and insufficient data problems. We introduce the Hybrid Data Balancing and Augmentation Approach for Mixed Tabular Data (HDBA-MTD), specifically designed to synthesize samples for underrepresented labels of output class and address issues of insufficient data instances. This approach enhances training data diversity, thereby paying special attention to the downstream classification and generalization performance. Experiments are carried out using benchmark datasets to assess the practicality of the presented hybrid model in real-world scenarios. This work has also attempted to quantify the privacy preservability for real data concerning ethical considerations and data security circumstances. The evaluation and analysis of these experiments show that the present hybrid model performs favorably compared to other current hybrid synthetic data generation methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

From shallows to depths: unveiling hybrid synthetic data modeling for enhanced learning with privacy considerations in naturally imbalanced datasets

Abstract

Talk to us

Similar Papers

More From: International Journal of Computers and Applications

Lead the way for us

Similar Papers

ConvGeN: A convex space learning approach for deep-generative oversampling and imbalanced classification of small tabular datasets
Kristian Schultz ... Olaf Wolkenhauer
Pattern Recognition | VOL. 147
Kristian Schultz, et. al.Kristian Schultz ... Olaf Wolkenhauer
20 Nov 2023
Pattern Recognition | VOL. 147

Generation of Synthetic Tabular Healthcare Data Using Generative Adversarial Networks
Alireza Hossein Zadeh Nik ... Pål Halvorsen
-
Alireza Hossein Zadeh Nik, et. al.Alireza Hossein Zadeh Nik ... Pål Halvorsen
01 Jan 2023
01 Jan 2023

The application of deep generative models in urban form generation based on topology: a review
Bo Lin ... Simon Lannon
Architectural Science Review | VOL. 67
Bo Lin, et. al.Bo Lin ... Simon Lannon
12 May 2023
Architectural Science Review | VOL. 67

Challenges and opportunities of generative models on tabular data
Alex X Wang ... Binh P Nguyen
Applied Soft Computing | VOL. 166
Alex X Wang, et. al.Alex X Wang ... Binh P Nguyen
07 Sep 2024
Applied Soft Computing | VOL. 166

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

From shallows to depths: unveiling hybrid synthetic data modeling for enhanced learning with privacy considerations in naturally imbalanced datasets

Abstract

Talk to us

Similar Papers

More From: International Journal of Computers and Applications