Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia

Similar Papers
  • Research Article
  • 10.53894/ijirss.v8i5.8655
Generative adversarial networks for synthetic data generation: A systematic review of techniques, applications, and evaluation methods
  • Jul 18, 2025
  • International Journal of Innovative Research and Scientific Studies
  • Rajermani Thinakaran + 4 more

Generative adversarial networks (GANs), which have emerged as one of the powerful frameworks for generating synthetic data, have proven remarkably capable across domains. This systematic review explores the rapidly evolving GAN landscape, particularly their applications for generating high-fidelity synthetic data that resemble real-world datasets' statistical properties. We comprehensively analyze recent literature to present the following key findings: 1. GANs' Capabilities: GANs have demonstrated significant potential across various fields, especially in creating synthetic data that mimic real-world datasets. 2. State-of-the-Art Architectures: Advanced GAN variants, such as Conditional GANs, Wasserstein GANs, and Cycle GANs, have shown great promise for transformation in sectors like healthcare, finance, and image processing. 3. Evaluation Methodologies: Metrics for assessing GAN-generated data include statistical similarity, downstream task performance, and privacy preservation, highlighting strengths and limitations in current evaluation paradigms. 4. Training Difficulties: GANs face challenges such as mode collapse, instability, and sensitivity to hyperparameters, which require further innovation and exploration. Additionally, we critically examine the methodologies used to evaluate the quality and utility of GAN-generated data. Metrics like statistical similarity, downstream task performance, and privacy preservation provide a broad view of current strengths and limitations. Besides synthetic data generation using GAN-based methods, this review discusses training difficulties and emerging directions aimed at mitigating issues like mode collapse, instability, and hyperparameter sensitivity. The findings emphasize significant progress in GAN-based synthetic data generation but underline the need for a robust, standardized evaluation framework and continued innovation in model architectures. 1. Robust Evaluation Framework: Developing a standardized evaluation framework for GAN-generated data is essential for advancing the field. 2. Model Architecture Innovation: Ongoing innovation in model architectures is necessary to overcome current limitations and enhance GAN performance. 3. Synthetic Data Generation: GANs hold great potential for generating synthetic data, which can address data privacy concerns, data scarcity, and data augmentation needs. This review aims to help researchers and practitioners understand the current state and future directions of GAN applications in synthetic data generation.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 18
  • 10.2196/47859
Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy.
  • Nov 24, 2023
  • JMIR Medical Informatics
  • Ha Ye Jin Kang + 5 more

Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information. This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships. The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)-based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models. The synthetic data of the 3 diseases (non-small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better. This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.

  • Abstract
  • 10.1182/blood-2018-99-120102
A Naturally Occurring Canine Model of Chronic Lymphocytic Leukemia/Small Lymphocytic Lymphoma: IGHV Mutation Status, Gene Expression, and Clinical Outcome
  • Nov 29, 2018
  • Blood
  • Emily Rout + 4 more

A Naturally Occurring Canine Model of Chronic Lymphocytic Leukemia/Small Lymphocytic Lymphoma: IGHV Mutation Status, Gene Expression, and Clinical Outcome

  • Research Article
  • Cite Count Icon 8
  • 10.3390/app142310818
Boosting EEG and ECG Classification with Synthetic Biophysical Data Generated via Generative Adversarial Networks
  • Nov 22, 2024
  • Applied Sciences
  • Archana Venugopal + 1 more

This study presents a novel approach using Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP) to generate synthetic electroencephalography (EEG) and electrocardiogram (ECG) waveforms. The synthetic EEG data represent concentration and relaxation mental states, while the synthetic ECG data correspond to normal and abnormal states. By addressing the challenges of limited biophysical data, including privacy concerns and restricted volunteer availability, our model generates realistic synthetic waveforms learned from real data. Combining real and synthetic datasets improved classification accuracy from 92% to 98.45%, highlighting the benefits of dataset augmentation for machine learning performance. The WGAN-GP model achieved 96.84% classification accuracy for synthetic EEG data representing relaxation states and optimal accuracy for concentration states when classified using a fusion of convolutional neural networks (CNNs). A 50% combination of synthetic and real EEG data yielded the highest accuracy of 98.48%. For EEG signals, the real dataset consisted of 60-s recordings across four channels (TP9, AF7, AF8, and TP10) from four individuals, providing approximately 15,000 data points per subject per state. For ECG signals, the dataset contained 1200 real samples, each comprising 140 data points, representing normal and abnormal states. WGAN-GP outperformed a basic generative adversarial network (GAN) in generating reliable synthetic data. For ECG data, a support vector machine (SVM) classifier achieved an accuracy of 98% with real data and 95.8% with synthetic data. Synthetic ECG data improved the random forest (RF) classifier’s accuracy from 97% with real data alone to 98.40% when combined with synthetic data. Statistical significance was assessed using the Wilcoxon signed-rank test, demonstrating the robustness of the WGAN-GP model. Techniques such as discrete wavelet transform, downsampling, and upsampling were employed to enhance data quality. This method shows significant potential in addressing biophysical data scarcity and advancing applications in assistive technologies, human-robot interaction, and mental health monitoring, among other medical applications.

  • Abstract
  • Cite Count Icon 2
  • 10.1182/blood-2022-168646
Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
  • Nov 15, 2022
  • Blood
  • Saverio D'Amico + 19 more

Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies

  • Research Article
  • Cite Count Icon 9
  • 10.52756/ijerr.2023.v30.004
GLSTM: A novel approach for prediction of real & synthetic PID diabetes data using GANs and LSTM classification model
  • Apr 30, 2023
  • International Journal of Experimental Research and Review
  • Sushma Jaiswal + 1 more

Generative Adversarial Network (GAN) is a revolution in modern artificial systems. Deep learning-based Generative adversarial networks generate realistic synthetic tabular data. Synthetic data are used to enhance the size of a relatively small training dataset while ensuring the confidentiality of the original data. In this context, we implemented the GAN framework for generating diabetes data to help the health care professional in more clinical applications. GAN is used to validate the Pima Indian Diabetes (PID) Dataset. Various preprocessing techniques, such as handling missing values, outliers and data imbalance problems, enhance data quality. Some exploratory data analyses, such as heat maps, bar graphs and histograms, are used for data visualisation. We employed hypothesis testing to examine the resemblance between real data and GAN-generated synthetic data. In this study, we proposed a GAN-Long Short-Term Memory (GLSTM) system, in which GAN is used for data augmentation, and LSTM is used for diabetes classification. Additionally, various GAN models such as CTGAN, Vanilla GAN, Coupula GAN, Gaussian Coupula GAN, and TVAE GAN are used to generate the synthetic dataset. Experiments were conducted on real data, synthetic data, and by combining real and synthetic data. The model that used both real and synthetic data obtained a substantially better accuracy of 97% compared to 92% when only real data was used. We also observed that synthetic data could be used in place of real data, as the mean correlation between synthetic and real data is 0.93. Our study's findings outperformed when compared to state-of-the-art methodologies.

  • Research Article
  • Cite Count Icon 4
  • 10.3171/2025.4.focus25225
Synthetic neurosurgical data generation with generative adversarial networks and large language models:an investigation on fidelity, utility, and privacy.
  • Jul 1, 2025
  • Neurosurgical focus
  • Austin A Barr + 3 more

Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.

  • Abstract
  • 10.1182/blood.v122.21.4142.4142
Concomitant, T-Independent TLR9-Mediated and BCR-Mediated Activation Provides Signals For Optimal Telomerase Induction In Chronic Lymphocytic Leukemia Cells Regardless Of IGHV Mutation Status
  • Nov 15, 2013
  • Blood
  • Rajendra N Damle + 7 more

Concomitant, T-Independent TLR9-Mediated and BCR-Mediated Activation Provides Signals For Optimal Telomerase Induction In Chronic Lymphocytic Leukemia Cells Regardless Of IGHV Mutation Status

  • Abstract
  • 10.1182/blood.v120.21.2545.2545
Ultra-Deep Sequencing of De Novo IGHV Mutations in Activated CLL Cells: Evidence for Activation-Induced Deaminase Function.
  • Nov 16, 2012
  • Blood
  • Piers E.M Patten + 8 more

Ultra-Deep Sequencing of De Novo IGHV Mutations in Activated CLL Cells: Evidence for Activation-Induced Deaminase Function.

  • Research Article
  • Cite Count Icon 25
  • 10.59247/csol.v3i1.170
Understanding Generative Adversarial Networks (GANs): A Review
  • Feb 7, 2025
  • Control Systems and Optimization Letters
  • Purwono Purwono + 3 more

Generative Adversarial Networks (GANs) is an important breakthrough in artificial intelligence that uses two neural networks, a generator and a discriminator, that work in an adversarial framework. The generator generates synthetic data, while the discriminator evaluates the authenticity of the data. This dynamic interaction forms a minimax game that produces high-quality synthetic data. Since its introduction in 2014 by Ian Goodfellow, GAN has evolved through various innovative architectures, including Vanilla GAN, Conditional GAN (cGAN), Deep Convolutional GAN (DCGAN), CycleGAN, StyleGAN, Wasserstein GAN (WGAN), and BigGAN. Each of these architectures presents a novel approach to address technical challenges such as training stability, data diversification, and result quality. GANs have been widely applied in various sectors. In healthcare, GANs are used to generate synthetic medical images that support diagnostic development without violating patient privacy. In the media and entertainment industry, GANs facilitate the enhancement of image and video resolution, as well as the creation of realistic content. However, the development of GANs faces challenges such as mode collapse, training instability, and inadequate quality evaluation. In addition to technical challenges, GANs raise ethical issues, such as the misuse of the technology for deepfake creation. Legal regulations, detection tools, and public education are important mitigation measures. Future trends suggest that GANs will be increasingly used in text-to-image synthesis, realistic video generation, and integration with multimodal systems to support cross-disciplinary innovation.

  • Research Article
  • Cite Count Icon 9
  • 10.1111/apm.12692
IFI16 reduced expression is correlated with unfavorable outcome in chronic lymphocytic leukemia.
  • May 18, 2017
  • APMIS
  • Pier Paolo Piccaluga + 15 more

Chronic lymphocytic leukemia (CLL) is the most common leukemia in adults. Its clinical course is typically indolent; however, based on a series of pathobiological, clinical, genetic, and phenotypic parameters, patient survival varies from less than 5 to more than 20 years. In this paper, we show for the first time that the expression of the interferon-inducible DNA sensor IFI16, a member of the PYHIN protein family involved in proliferation inhibition and apoptosis regulation, is associated with the clinical outcome in CLL. We studied 99 CLLs cases by immunohistochemistry and 10 CLLs cases by gene expression profiling. We found quite variable degrees of IFI16 expression among CLLs cases. Noteworthy, we observed that a reduced IFI16 expression was associated with a very poor survival, but only in cases with ZAP70/CD38 expression. Furthermore, we found that IFI16 expression was associated with a specific gene expression signature. As IFI16 can be easily detected by immunohistochemistry or flow cytometry, it may become a part of phenotypic screening in CLL patients if its prognostic role is confirmed in independent series.

  • Research Article
  • 10.1088/1742-6596/3140/5/052007
On the utility of synthetic data for building energy research
  • Nov 1, 2025
  • Journal of Physics: Conference Series
  • A Tell + 3 more

Measurements from building energy management systems (BEMS) are critical for deploying data-driven operational solutions. However, time-series data can inadvertently expose occupant preferences and daily routines, raising privacy concerns. Synthetic data generation has emerged as a promising method to address these issues, with generative adversarial networks (GANs) and other generative models showing particular efficacy in replicating the characteristics of BEMS data. This study evaluates synthetic BEMS data generated by conditional GANs using actual measurements from the UMAR residential unit at the NEST demonstrator in Switzerland. The unit has also extensively hosted experimental data-driven solutions that prove operational energy-saving capabilities. The GAN model predicts electricity, heating, cooling energy consumption, and indoor air temperatures for three rooms, conditioned on weather data (dry-bulb temperature, relative humidity, solar radiation) and operational states (valve status, occupant presence). We assess the synthetic data’s utility by comparing GAN projections against simulations from a calibrated high-fidelity EnergyPlus model. Furthermore, we evaluate their practical utility by applying synthetic data to a data-driven predictive controller and observing system performance. NEST provides an ideal setup to train generative models on actual measurements and evaluate the synthetic data’s utility by replicating actual experiments.

  • Research Article
  • Cite Count Icon 2
  • 10.3233/shti240490
On the Fidelity-Privacy Tradeoff of Synthetic Cancer Registry Data.
  • Aug 22, 2024
  • Studies in health technology and informatics
  • Philipp Röchner

The sharing of personal health data is highly regulated due to privacy and security concerns. An alternative to sharing personal data is to share synthetic data, because ideally it should be impossible to reconstruct real personal data from synthetic data, which is called privacy. At the same time, the structure of the synthetic data should be as similar as possible to the structure of the real data to ensure that conclusions drawn from the synthetic data are also valid for the real data, which is called fidelity. Typically, there is a tradeoff between fidelity and privacy for synthetic health data. We study the fidelity and privacy of cancer data synthesized using generative machine learning approaches. To generate synthetic cancer data, we use variational autoencoders (VAEs), generative adversarial networks (GANs), and denoising diffusion probabilistic models (DDPMs). The tabular cancer registry data studied have nine categorical variables from breast cancer patients. We find that DDPMs generate synthetic cancer data with higher fidelity; that is, the structure of the synthetic data is more similar to the real cancer data than the data generated by VAEs and GANs. At the same time, synthetic cancer data from DDPMs pose a greater privacy risk because the data are more likely to reveal information from real patients than synthetic data from VAEs and GANs.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 14
  • 10.1371/journal.pone.0260308
Generative adversarial networks for generating synthetic features for Wi-Fi signal quality.
  • Nov 23, 2021
  • PLOS ONE
  • Mauro Castelli + 4 more

Wireless networks are among the fundamental technologies used to connect people. Considering the constant advancements in the field, telecommunication operators must guarantee a high-quality service to keep their customer portfolio. To ensure this high-quality service, it is common to establish partnerships with specialized technology companies that deliver software services in order to monitor the networks and identify faults and respective solutions. A common barrier faced by these specialized companies is the lack of data to develop and test their products. This paper investigates the use of generative adversarial networks (GANs), which are state-of-the-art generative models, for generating synthetic telecommunication data related to Wi-Fi signal quality. We developed, trained, and compared two of the most used GAN architectures: the Vanilla GAN and the Wasserstein GAN (WGAN). Both models presented satisfactory results and were able to generate synthetic data similar to the real ones. In particular, the distribution of the synthetic data overlaps the distribution of the real data for all of the considered features. Moreover, the considered generative models can reproduce the same associations observed for the synthetic features. We chose the WGAN as the final model, but both models are suitable for addressing the problem at hand.

  • Research Article
  • Cite Count Icon 44
  • 10.1007/s11042-023-15747-6
Generative adversarial network based synthetic data training model for lightweight convolutional neural networks
  • May 20, 2023
  • Multimedia Tools and Applications
  • Ishfaq Hussain Rather + 1 more

Inadequate training data is a significant challenge for deep learning techniques, particularly in applications where data is difficult to get, and publicly available datasets are uncommon owing to ethical and privacy concerns. Various approaches, such as data augmentation and transfer learning, are employed to address this problem, which help to some extent in removing this limitation. However, after a certain amount of data augmentation, the quality of the generated data stalls, and transfer learning suffers from the issue of negative transfer. This paper proposes a novel generative adversarial network-based synthetic data training (GAN-ST) model to generate synthetic data for training a lightweight convolutional neural network (CNN). An enhanced generator is proposed to quickly saturate and cover the colour space of the training distribution. The GAN-ST model is based on Deep Convolutional Generative Adversarial Network(s) (DCGAN) and Conditional Generative Adversarial Network(s) (CGAN) models, which consist of an enhanced generator. The study evaluates the accuracy of a CNN model on the MNIST and CIFAR 10 datasets using both original and synthetic data. The results revealed an impressive classifier accuracy on the MNIST dataset, achieving an accuracy of 99.38% on GAN-ST-generated synthetic training data, which is only 0.05% lower than the performance on original data-based training. The classifier performance on the CIFAR dataset is also remarkable, achieving an accuracy of 90.23%. The performance of CNN trained using GAN-ST-based synthetic data is notable, with the most considerable improvement of 0.66% and 7.06%, over a single GAN-based synthetic data training for the MNIST and CIFAR datasets, respectively. By training two GANs independently, the GAN-ST model covers different parts of the original data distribution, resulting in a more diverse and realistic training data set for the classifier. This diverse set of synthetic data, when used to train a CNN, shows better generalization to new data, leading to improved classification accuracy.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant