Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Generation of Synthetic Continuous Numerical Data Using Generative Adversarial Networks

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Continuous numerical is a type of data which often used for unsupervised learning such as clustering. However, this valuable data often provided in a small amount because it is hard to obtain, expensive, required an expert to collect them, or not available because it contains confidential information that cannot be published. These limited data situations can be an obstacle for processing and analyzing data or restrain clustering related research in general. Therefore, there is a need to be an alternative that can replace or increase the amount of data. The proposed method is generating synthetic continuous numerical data using Generative Adversarial Networks (GANs). This study used two GAN architectures (GAN and CGAN) and focused on unlabeled continuous numerical data to provide replacement or additional data for the clustering task. The Quality of synthetic data was measured using the accuracy of the xgboost algorithm in classifying real and synthetic data. When the xgboost accuracy of perfectly realistic data is 50%, synthetic data based on CGAN achieving 63%. The result of this study shows that GAN can generate data similar enough and not significantly different from the real data.

Similar Papers
  • Research Article
  • Cite Count Icon 9
  • 10.52756/ijerr.2023.v30.004
GLSTM: A novel approach for prediction of real & synthetic PID diabetes data using GANs and LSTM classification model
  • Apr 30, 2023
  • International Journal of Experimental Research and Review
  • Sushma Jaiswal + 1 more

Generative Adversarial Network (GAN) is a revolution in modern artificial systems. Deep learning-based Generative adversarial networks generate realistic synthetic tabular data. Synthetic data are used to enhance the size of a relatively small training dataset while ensuring the confidentiality of the original data. In this context, we implemented the GAN framework for generating diabetes data to help the health care professional in more clinical applications. GAN is used to validate the Pima Indian Diabetes (PID) Dataset. Various preprocessing techniques, such as handling missing values, outliers and data imbalance problems, enhance data quality. Some exploratory data analyses, such as heat maps, bar graphs and histograms, are used for data visualisation. We employed hypothesis testing to examine the resemblance between real data and GAN-generated synthetic data. In this study, we proposed a GAN-Long Short-Term Memory (GLSTM) system, in which GAN is used for data augmentation, and LSTM is used for diabetes classification. Additionally, various GAN models such as CTGAN, Vanilla GAN, Coupula GAN, Gaussian Coupula GAN, and TVAE GAN are used to generate the synthetic dataset. Experiments were conducted on real data, synthetic data, and by combining real and synthetic data. The model that used both real and synthetic data obtained a substantially better accuracy of 97% compared to 92% when only real data was used. We also observed that synthetic data could be used in place of real data, as the mean correlation between synthetic and real data is 0.93. Our study's findings outperformed when compared to state-of-the-art methodologies.

  • Abstract
  • Cite Count Icon 2
  • 10.1182/blood-2022-168646
Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
  • Nov 15, 2022
  • Blood
  • Saverio D'Amico + 19 more

Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies

  • Abstract
  • Cite Count Icon 1
  • 10.1182/blood-2022-171057
Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia
  • Nov 15, 2022
  • Blood
  • Dimitris Karletsos + 2 more

Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia

  • Research Article
  • Cite Count Icon 21
  • 10.1109/jbhi.2023.3236722
Characterization of Synthetic Health Data Using Rule-Based Artificial Intelligence Models.
  • Aug 1, 2023
  • IEEE Journal of Biomedical and Health Informatics
  • Marta Lenatti + 4 more

The aim of this study is to apply and characterize eXplainable AI (XAI) to assess the quality of synthetic health data generated using a data augmentation algorithm. In this exploratory study, several synthetic datasets are generated using various configurations of a conditional Generative Adversarial Network (GAN) from a set of 156 observations related to adult hearing screening. A rule-based native XAI algorithm, the Logic Learning Machine, is used in combination with conventional utility metrics. The classification performance in different conditions is assessed: models trained and tested on synthetic data, models trained on synthetic data and tested on real data, and models trained on real data and tested on synthetic data. The rules extracted from real and synthetic data are then compared using a rule similarity metric. The results indicate that XAI may be used to assess the quality of synthetic data by (i) the analysis of classification performance and (ii) the analysis of the rules extracted on real and synthetic data (number, covering, structure, cut-off values, and similarity). These results suggest that XAI can be used in an original way to assess synthetic health data and extract knowledge about the mechanisms underlying the generated data.

  • Preprint Article
  • 10.32920/26052700.v1
Novel Generative Adversarial Network Architectures for Generating image Data
  • Jun 19, 2024
  • Sanaz Mohammad Jafari

<p>High data collection costs and complicated data access regulations increase the demand for synthetic data. Generative Adversarial Networks (GANs) are a novel generative framework with great potential for high quality synthetic data generation. GANs formulate the true distribution of data implicitly, and the success of GANs are often measured based on the similarity of generated data to this true distribution. GANs were originally designed to work with continuous data. However, many important real-world datasets such as medical images involve discontinuous distributions. GAN training for discontinuous distributions is relatively more challenging, as the training procedure often suffers from instability and mode collapse issues. This dissertation focuses on designing novel GAN architectures to generate representative synthetic image data, and proposes new structures to alleviate GANs' mode collapse issue. As part of this thesis, novel applications of image data generation with GANs have been also investigated for important problems arising in the telecommunication industry and medical domain. Specifically, we first explore various GAN structures to generate engineered electromagnetic surfaces. We consider the continuous approximation of the data and explore the capabilities of feed-forward and convolutional GANs for synthetic data generation. Next, we introduce a novel GAN architecture to address the problem of mode collapse in GAN training. The proposed structure incorporates a third network that penalizes the generator for generating low diversity samples. Lastly, we study the challenging problem of object generation in 3D space using GANs, and we propose extensions to existing 3D GAN structures to generate connected 3D volumes. Additionally, we explore a more challenging version of this 3D volume generation problem by generating connected volumes packed with spheres. This research has applications in radiosurgery treatment planning, and the proposed 3D GAN structure can help generate rare, unseen 3D tumor volumes and information on how to treat these tumors. Accordingly, our analysis contributes to overcoming data scarcity issues (e.g., due to privacy considerations) for an important practical problem in the medical domain.</p>

  • Preprint Article
  • 10.32920/26052700
Novel Generative Adversarial Network Architectures for Generating image Data
  • Jun 19, 2024
  • Sanaz Mohammad Jafari

<p>High data collection costs and complicated data access regulations increase the demand for synthetic data. Generative Adversarial Networks (GANs) are a novel generative framework with great potential for high quality synthetic data generation. GANs formulate the true distribution of data implicitly, and the success of GANs are often measured based on the similarity of generated data to this true distribution. GANs were originally designed to work with continuous data. However, many important real-world datasets such as medical images involve discontinuous distributions. GAN training for discontinuous distributions is relatively more challenging, as the training procedure often suffers from instability and mode collapse issues. This dissertation focuses on designing novel GAN architectures to generate representative synthetic image data, and proposes new structures to alleviate GANs' mode collapse issue. As part of this thesis, novel applications of image data generation with GANs have been also investigated for important problems arising in the telecommunication industry and medical domain. Specifically, we first explore various GAN structures to generate engineered electromagnetic surfaces. We consider the continuous approximation of the data and explore the capabilities of feed-forward and convolutional GANs for synthetic data generation. Next, we introduce a novel GAN architecture to address the problem of mode collapse in GAN training. The proposed structure incorporates a third network that penalizes the generator for generating low diversity samples. Lastly, we study the challenging problem of object generation in 3D space using GANs, and we propose extensions to existing 3D GAN structures to generate connected 3D volumes. Additionally, we explore a more challenging version of this 3D volume generation problem by generating connected volumes packed with spheres. This research has applications in radiosurgery treatment planning, and the proposed 3D GAN structure can help generate rare, unseen 3D tumor volumes and information on how to treat these tumors. Accordingly, our analysis contributes to overcoming data scarcity issues (e.g., due to privacy considerations) for an important practical problem in the medical domain.</p>

  • Conference Article
  • Cite Count Icon 3
  • 10.1109/iccc56324.2022.10065986
Application of Generative Adversarial Network Tabular Data Synthesis for Federal Learning-based Thermal Process Performance Prediction
  • Dec 9, 2022
  • Lewei Xu + 1 more

Process performance prediction now has a fresh and efficient method thanks to machine learning. Existing techniques do not provide good data protection capabilities. The novelty of this work is proposed and validated the use of virtual synthetic thermal processing process performance data as input to machine learning models, where the ‘train on synthetic data - test on real data’ approach is used to pioneer a novel framework for predicting thermal processing process performance. First, the data generated by the table generation adversarial network is applied to the federal learning model for performance prediction. Based on the input-output relationship curve, an evaluation index is proposed for the generation of data for thermal processing performance prediction. Finally, the effect of the generated sample size on the prediction of the machine learning model is investigated. The model is trained using 10,00 synthetic design data and tested using 915 real experimental data. The results show that the synthetic data contribute to the good performance prediction capability of the machine learning model. The use of this method will help to extend the application of federal learning based thermal processing process performance.

  • Research Article
  • Cite Count Icon 39
  • 10.1016/j.eswa.2022.117936
Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems
  • Jun 27, 2022
  • Expert Systems with Applications
  • Marc Chalé + 1 more

Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 10
  • 10.3390/app11062787
Generating Synthetic Fermentation Data of Shindari, a Traditional Jeju Beverage, Using Multiple Imputation Ensemble and Generative Adversarial Networks
  • Mar 20, 2021
  • Applied Sciences
  • Debapriya Hazra + 1 more

Fermentation is an age-old technique used to preserve food by restoring proper microbial balance. Boiled barley and nuruk are fermented for a short period to produce Shindari, a traditional beverage for the people of Jeju, South Korea. Shindari has been proven to be a drink of multiple health benefits if fermented for an optimal period. It is necessary to predict the ideal fermentation time required by each microbial community to keep the advantages of the microorganisms produced by the fermentation process in Shindari intact and to eliminate contamination. Prediction through machine learning requires past data but the process of obtaining fermentation data of Shindari is time consuming, expensive, and not easily available. Therefore, there is a need to generate synthetic fermentation data to explore various benefits of the drink and to reduce any risk from overfermentation. In this paper, we propose a model that takes incomplete tabular fermentation data of Shindari as input and uses multiple imputation ensemble (MIE) and generative adversarial networks (GAN) to generate synthetic fermentation data that can be later used for prediction and microbial spoilage control. For multiple imputation, we used multivariate imputation by chained equations and random forest imputation, and ensembling was done using the bagging and stacking method. For generating synthetic data, we remodeled the tabular GAN with skip connections and adapted the architecture of Wasserstein GAN with gradient penalty. We compared the performance of our model with other imputation and ensemble models using various evaluation metrics and visual representations. Our GAN model could overcome the mode collapse problem and converged at a faster rate than existing GAN models for synthetic data generation. Experiment results show that our proposed model executes with less error, is more accurate, and generates significantly better synthetic fermentation data compared to other models.

  • Research Article
  • 10.1142/s0218213025500101
Leveraging Multi-Modal Generative Adversarial Networks (GANs) for Synthetic Crime Data Simulation
  • May 1, 2025
  • International Journal on Artificial Intelligence Tools
  • Tianyu Fan + 2 more

Crime analysis and predictive modeling in criminology face challenges due to data scarcity and privacy limitations. This happens due to a lack of comprehensive datasets that capture the complex aspects of crime. This research aims to develop a Multi-modal Generative Adversarial Network (MM-GAN) framework to accurately model complicated synthetic crime data. The goal of MM-GAN, which integrates several GAN architectures, is to provide varied and high-fidelity data depicting many crime elements in a unified model. These features include geographical distribution, temporal patterns, and category information. MM-GAN uses Conditional GANs (cGANs) to regulate the kinds of data produced according to crime characteristics like time, place, and kind. To further enhance the usability of the produced data for model training and guarantee that it properly represents real-world classifications, Auxiliary Classifier GANs (AC-GANs) are included to classify synthetic data. As a data format bridge, CycleGAN allows MM-GAN to represent various sources by facilitating cross-domain conversions between structured and unstructured. The model’s capacity to simulate seasonality and trends is enhanced by adding temporal GAN layers, which allow the model to capture sequential crime patterns. A strong resource for predictive analysis, risk assessment, and simulation training, MM-GAN-generated synthetic data closely matches real crime patterns, according to experiments. This framework provides a user-friendly and privacy-protecting tool for creating enhanced datasets, which is useful for criminology academics and law enforcement authorities. MM-GAN provides a scalable solution via synthetic data simulation to empower secure environments with Artificial Intelligence (AI)-driven insights and models.

  • Research Article
  • Cite Count Icon 4
  • 10.3171/2025.4.focus25225
Synthetic neurosurgical data generation with generative adversarial networks and large language models:an investigation on fidelity, utility, and privacy.
  • Jul 1, 2025
  • Neurosurgical focus
  • Austin A Barr + 3 more

Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 14
  • 10.1371/journal.pone.0260308
Generative adversarial networks for generating synthetic features for Wi-Fi signal quality.
  • Nov 23, 2021
  • PLOS ONE
  • Mauro Castelli + 4 more

Wireless networks are among the fundamental technologies used to connect people. Considering the constant advancements in the field, telecommunication operators must guarantee a high-quality service to keep their customer portfolio. To ensure this high-quality service, it is common to establish partnerships with specialized technology companies that deliver software services in order to monitor the networks and identify faults and respective solutions. A common barrier faced by these specialized companies is the lack of data to develop and test their products. This paper investigates the use of generative adversarial networks (GANs), which are state-of-the-art generative models, for generating synthetic telecommunication data related to Wi-Fi signal quality. We developed, trained, and compared two of the most used GAN architectures: the Vanilla GAN and the Wasserstein GAN (WGAN). Both models presented satisfactory results and were able to generate synthetic data similar to the real ones. In particular, the distribution of the synthetic data overlaps the distribution of the real data for all of the considered features. Moreover, the considered generative models can reproduce the same associations observed for the synthetic features. We chose the WGAN as the final model, but both models are suitable for addressing the problem at hand.

  • Research Article
  • Cite Count Icon 33
  • 10.1109/tai.2022.3229289
A Universal Metric for Robust Evaluation of Synthetic Tabular Data
  • Jan 1, 2024
  • IEEE Transactions on Artificial Intelligence
  • Vikram S Chundawat + 4 more

Synthetic tabular data generation becomes crucial when real data is limited, expensive to collect, or simply cannot be used due to privacy concerns. However, producing good quality synthetic data is challenging. Several probabilistic, statistical, generative adversarial networks (GANs), and variational auto-encoder (VAEs) based approaches have been presented for synthetic tabular data generation. Once generated, evaluating the quality of the synthetic data is quite challenging. Some of the traditional metrics have been used in the literature but there is lack of a common, robust, and single metric. This makes it difficult to properly compare the effectiveness of different synthetic tabular data generation methods. In this paper we propose a new universal metric, TabSynDex, for robust evaluation of synthetic data. The proposed metric assesses the similarity of synthetic data with real data through different component scores which evaluate the characteristics that are desirable for “high quality” synthetic data. Being a single score metric and having an implicit bound, TabSynDex can also be used to observe and evaluate the training of neural network based approaches. This would help in obtaining insights that was not possible earlier. We present several baseline models for comparative analysis of the proposed evaluation metric with existing generative models. We also give a comparative analysis between TabSynDex and existing synthetic tabular data evaluation metrics. This shows the effectiveness and universality of our metric over the existing metrics. Source Code: <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/vikram2000b/tabsyndex</uri>

  • Research Article
  • Cite Count Icon 2
  • 10.3233/shti240490
On the Fidelity-Privacy Tradeoff of Synthetic Cancer Registry Data.
  • Aug 22, 2024
  • Studies in health technology and informatics
  • Philipp Röchner

The sharing of personal health data is highly regulated due to privacy and security concerns. An alternative to sharing personal data is to share synthetic data, because ideally it should be impossible to reconstruct real personal data from synthetic data, which is called privacy. At the same time, the structure of the synthetic data should be as similar as possible to the structure of the real data to ensure that conclusions drawn from the synthetic data are also valid for the real data, which is called fidelity. Typically, there is a tradeoff between fidelity and privacy for synthetic health data. We study the fidelity and privacy of cancer data synthesized using generative machine learning approaches. To generate synthetic cancer data, we use variational autoencoders (VAEs), generative adversarial networks (GANs), and denoising diffusion probabilistic models (DDPMs). The tabular cancer registry data studied have nine categorical variables from breast cancer patients. We find that DDPMs generate synthetic cancer data with higher fidelity; that is, the structure of the synthetic data is more similar to the real cancer data than the data generated by VAEs and GANs. At the same time, synthetic cancer data from DDPMs pose a greater privacy risk because the data are more likely to reveal information from real patients than synthetic data from VAEs and GANs.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 111
  • 10.1371/journal.pone.0267976
SinGAN-Seg: Synthetic training data generation for medical image segmentation.
  • May 2, 2022
  • PLOS ONE
  • Vajira Thambawita + 8 more

Analyzing medical data to find abnormalities is a time-consuming and costly task, particularly for rare abnormalities, requiring tremendous efforts from medical experts. Therefore, artificial intelligence has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. However, the machine learning models used to build these tools are highly dependent on the data used to train them. Large amounts of data can be difficult to obtain in medicine due to privacy reasons, expensive and time-consuming annotations, and a general lack of data samples for infrequent lesions. In this study, we present a novel synthetic data generation pipeline, called SinGAN-Seg, to produce synthetic medical images with corresponding masks using a single training image. Our method is different from the traditional generative adversarial networks (GANs) because our model needs only a single image and the corresponding ground truth to train. We also show that the synthetic data generation pipeline can be used to produce alternative artificial segmentation datasets with corresponding ground truth masks when real datasets are not allowed to share. The pipeline is evaluated using qualitative and quantitative comparisons between real data and synthetic data to show that the style transfer technique used in our pipeline significantly improves the quality of the generated data and our method is better than other state-of-the-art GANs to prepare synthetic images when the size of training datasets are limited. By training UNet++ using both real data and the synthetic data generated from the SinGAN-Seg pipeline, we show that the models trained on synthetic data have very close performances to those trained on real data when both datasets have a considerable amount of training data. In contrast, we show that synthetic data generated from the SinGAN-Seg pipeline improves the performance of segmentation models when training datasets do not have a considerable amount of data. All experiments were performed using an open dataset and the code is publicly available on GitHub.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant