Data Generation Approach Research Articles

Data serves as the foundational element that drives model development and performance in machine learning and deep learning. The learning algorithms are as good as data on which they are trained on. Data constrained environments pose serious threat to the effectiveness of learning algorithms. Data limitations stem from a range of factors, including regulatory restrictions, privacy concerns, and the inherent scarcity of relevant data. This constrained availability of data often leads to class imbalance problem in the context of classification tasks. To address the challenges of limited data, the algorithmic generated data, also known as synthetic data, is gaining significant traction as a cost-effective, readily available, and secure alternative. Synthetic data generation techniques can be employed to enhance dataset size by augmenting data samples and to address class imbalance by increasing the number of minority class instances. These techniques generally fall into two main categories: distance-based shallow models and probability estimation-based deep generative models. Shallow interpolation-based models generate new data points within the local space between existing data points, while deep density estimation based models generate new data by learning the whole distribution of data. . In the context of smaller datasets, deep generative models often struggle to accurately estimate the probability distribution of whole data. To effectively represent the global data distribution, these deep models require initial starting data samples to guide their approximation. This paper examines the potential of integration of shallow and deep generative models in the data generation pipeline for effective synthetic data augmentation which furthers enhanced learning and generalization of downstream tasks. In this work, we present a hybrid approach of tabular data generation involving mixed type data attributes (continuous, discrete) and pay special attention to data imbalance and insufficient data problems. We introduce the Hybrid Data Balancing and Augmentation Approach for Mixed Tabular Data (HDBA-MTD), specifically designed to synthesize samples for underrepresented labels of output class and address issues of insufficient data instances. This approach enhances training data diversity, thereby paying special attention to the downstream classification and generalization performance. Experiments are carried out using benchmark datasets to assess the practicality of the presented hybrid model in real-world scenarios. This work has also attempted to quantify the privacy preservability for real data concerning ethical considerations and data security circumstances. The evaluation and analysis of these experiments show that the present hybrid model performs favorably compared to other current hybrid synthetic data generation methods.

Read full abstract

The development of Network Intrusion Detection Systems (NIDS) requires labeled network traffic, especially to train and evaluate machine learning approaches. Besides the recording of traffic, the generation of traffic via generative models is a promising approach to obtain vast amounts of labeled data. There exist various machine learning approaches for data generation, but the assessment of the data quality is complex and not standardized. The lack of common quality criteria complicates the comparison of synthetic data generation approaches and synthetic data.Our work addresses this gap in multiple steps. Firstly, we review and categorize existing approaches for evaluating synthetic data in the network traffic domain and other data domains as well. Secondly, based on our review, we compile a setup of metrics that are suitable for the NetFlow domain, which we aggregate into two metrics Data Dissimilarity Score and Domain Dissimilarity Score. Thirdly, we evaluate the proposed metrics on real world data sets, to demonstrate their ability to distinguish between samples from different data sets. As a final step, we conduct a case study to demonstrate the application of the metrics for the evaluation of synthetic data. We calculate the metrics on samples from real NetFlow data sets to define an upper and lower bound for inter- and intra-data set similarity scores. Afterward, we generate synthetic data via Generative Adversarial Network (GAN) and Generative Pre-trained Transformer 2 (GPT-2) and apply the metrics to these synthetic data and incorporate these lower bound baseline results to obtain an objective benchmark. The application of the benchmarking process is demonstrated on three NetFlow benchmark data sets, NF-CSE-CIC-IDS2018, NF-ToN-IoT and NF-UNSW-NB15. Our demonstration indicates that this benchmark framework captures the differences in similarity between real world data and synthetic data of varying quality well, and can therefore be used to assess the quality of generated synthetic data.

Read full abstract

Data Generation Approach Research Articles

Related Topics

Articles published on Data Generation Approach

From shallows to depths: unveiling hybrid synthetic data modeling for enhanced learning with privacy considerations in naturally imbalanced datasets

Generating 3D Models for UAV-Based Detection of Riparian PET Plastic Bottle Waste: Integrating Local Social Media and InstantMesh

Mitigating adversarial cascades in large graph environments

Improving Machine Learned Force Fields for Complex Fluids through Enhanced Sampling: A Liquid Crystal Case Study.

AR-ADASYN: angle radius-adaptive synthetic data generation approach for imbalanced learning

A systematic approach for data generation for intelligent fault detection and diagnosis in District Heating

Synthetic datasets for open software development in rare disease research

Benchmarking of synthetic network data: Reviewing challenges and approaches

Relation Extraction in Underexplored Biomedical Domains: A Diversity-optimised Sampling and Synthetic Data Generation Approach

How Preceptors Support Pharmacy Learner Professional Identity Formation

Guided Docking as a Data Generation Approach Facilitates Structure-Based Machine Learning on Kinases.

Addressing Data Scarcity in the Medical Domain: A GPT-Based Approach for Synthetic Data Generation and Feature Extraction

Inverse physics–informed neural networks for digital twin–based bearing fault diagnosis under imbalanced samples

Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer.

Synthetic Data and Hierarchical Object Detection in Overhead Imagery

PVS-GEN: Systematic Approach for Universal Synthetic Data Generation Involving Parameterization, Verification, and Segmentation.

Channel Attention GAN-Based Synthetic Weed Generation for Precise Weed Identification.

Unlocking biomedical data sharing: A structured approach with digital twins and artificial intelligence (AI) for open health sciences.

KI-MAG: A knowledge-infused abstractive question answering system in medical domain

GAMMA: Graph Attention Model for Multiple Agents to Solve Team Orienteering Problem With Multiple Depots.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Data Generation Approach Research Articles

Related Topics

Articles published on Data Generation Approach

From shallows to depths: unveiling hybrid synthetic data modeling for enhanced learning with privacy considerations in naturally imbalanced datasets

Generating 3D Models for UAV-Based Detection of Riparian PET Plastic Bottle Waste: Integrating Local Social Media and InstantMesh

Mitigating adversarial cascades in large graph environments

Improving Machine Learned Force Fields for Complex Fluids through Enhanced Sampling: A Liquid Crystal Case Study.

AR-ADASYN: angle radius-adaptive synthetic data generation approach for imbalanced learning

A systematic approach for data generation for intelligent fault detection and diagnosis in District Heating

Synthetic datasets for open software development in rare disease research

Benchmarking of synthetic network data: Reviewing challenges and approaches

Relation Extraction in Underexplored Biomedical Domains: A Diversity-optimised Sampling and Synthetic Data Generation Approach

How Preceptors Support Pharmacy Learner Professional Identity Formation

Guided Docking as a Data Generation Approach Facilitates Structure-Based Machine Learning on Kinases.

Addressing Data Scarcity in the Medical Domain: A GPT-Based Approach for Synthetic Data Generation and Feature Extraction

Inverse physics–informed neural networks for digital twin–based bearing fault diagnosis under imbalanced samples

Synthetic Data Improve Survival Status Prediction Models in Early-Onset Colorectal Cancer.

Synthetic Data and Hierarchical Object Detection in Overhead Imagery

PVS-GEN: Systematic Approach for Universal Synthetic Data Generation Involving Parameterization, Verification, and Segmentation.

Channel Attention GAN-Based Synthetic Weed Generation for Precise Weed Identification.

Unlocking biomedical data sharing: A structured approach with digital twins and artificial intelligence (AI) for open health sciences.

KI-MAG: A knowledge-infused abstractive question answering system in medical domain

GAMMA: Graph Attention Model for Multiple Agents to Solve Team Orienteering Problem With Multiple Depots.