Data serves as the foundational element that drives model development and performance in machine learning and deep learning. The learning algorithms are as good as data on which they are trained on. Data constrained environments pose serious threat to the effectiveness of learning algorithms. Data limitations stem from a range of factors, including regulatory restrictions, privacy concerns, and the inherent scarcity of relevant data. This constrained availability of data often leads to class imbalance problem in the context of classification tasks. To address the challenges of limited data, the algorithmic generated data, also known as synthetic data, is gaining significant traction as a cost-effective, readily available, and secure alternative. Synthetic data generation techniques can be employed to enhance dataset size by augmenting data samples and to address class imbalance by increasing the number of minority class instances. These techniques generally fall into two main categories: distance-based shallow models and probability estimation-based deep generative models. Shallow interpolation-based models generate new data points within the local space between existing data points, while deep density estimation based models generate new data by learning the whole distribution of data. . In the context of smaller datasets, deep generative models often struggle to accurately estimate the probability distribution of whole data. To effectively represent the global data distribution, these deep models require initial starting data samples to guide their approximation. This paper examines the potential of integration of shallow and deep generative models in the data generation pipeline for effective synthetic data augmentation which furthers enhanced learning and generalization of downstream tasks. In this work, we present a hybrid approach of tabular data generation involving mixed type data attributes (continuous, discrete) and pay special attention to data imbalance and insufficient data problems. We introduce the Hybrid Data Balancing and Augmentation Approach for Mixed Tabular Data (HDBA-MTD), specifically designed to synthesize samples for underrepresented labels of output class and address issues of insufficient data instances. This approach enhances training data diversity, thereby paying special attention to the downstream classification and generalization performance. Experiments are carried out using benchmark datasets to assess the practicality of the presented hybrid model in real-world scenarios. This work has also attempted to quantify the privacy preservability for real data concerning ethical considerations and data security circumstances. The evaluation and analysis of these experiments show that the present hybrid model performs favorably compared to other current hybrid synthetic data generation methods.
Read full abstract