GLSTM: A novel approach for prediction of real & synthetic PID diabetes data using GANs and LSTM classification model
This study introduces a GAN-Long Short-Term Memory system for diabetes prediction, utilizing various GAN models to generate synthetic data that, when combined with real data, improves classification accuracy to 97% from 92%, with a mean correlation of 0.93 between synthetic and real data, outperforming existing methods.
Generative Adversarial Network (GAN) is a revolution in modern artificial systems. Deep learning-based Generative adversarial networks generate realistic synthetic tabular data. Synthetic data are used to enhance the size of a relatively small training dataset while ensuring the confidentiality of the original data. In this context, we implemented the GAN framework for generating diabetes data to help the health care professional in more clinical applications. GAN is used to validate the Pima Indian Diabetes (PID) Dataset. Various preprocessing techniques, such as handling missing values, outliers and data imbalance problems, enhance data quality. Some exploratory data analyses, such as heat maps, bar graphs and histograms, are used for data visualisation. We employed hypothesis testing to examine the resemblance between real data and GAN-generated synthetic data. In this study, we proposed a GAN-Long Short-Term Memory (GLSTM) system, in which GAN is used for data augmentation, and LSTM is used for diabetes classification. Additionally, various GAN models such as CTGAN, Vanilla GAN, Coupula GAN, Gaussian Coupula GAN, and TVAE GAN are used to generate the synthetic dataset. Experiments were conducted on real data, synthetic data, and by combining real and synthetic data. The model that used both real and synthetic data obtained a substantially better accuracy of 97% compared to 92% when only real data was used. We also observed that synthetic data could be used in place of real data, as the mean correlation between synthetic and real data is 0.93. Our study's findings outperformed when compared to state-of-the-art methodologies.
- Research Article
- 10.52458/23485477.2025.v12.iss2.kp.a1
- Jan 1, 2025
- Kaav International Journal of Science, Engineering & Technology:A Peer Review Quarterly Journal
Liver disease, a major global health issue causing approximately 2 million deaths annually, requires accurate predictive models for early detection. This study proposes GANN, a novel framework combining Generative Adversarial Networks (GANs) for synthetic data generation and Artificial Neural Networks (ANNs) for liver disease classification using the Indian Liver Patient Dataset (ILPD). The ILPD, with 583 samples and 10 features, faces challenges like missing values (1.7% in Albumin_and_Globulin_Ratio), class imbalance (416 liver disease vs. 167 non-liver disease cases), and outliers. We address these through preprocessing techniques such as MICE imputation, log transformation, and Proximity Weighted Synthetic Oversampling (PROW). Five GAN variants?CTGAN, Vanilla GAN, Copula GAN, Gaussian Copula GAN, and TVAE?generate 2,000 synthetic samples, validated by Kolmogorov-Smirnov (KS) tests (mean correlation 0.92 with real data). Visualizations, including histograms and correlation matrices, reveal data distributions and relationships. The GANN model achieves 95% accuracy with combined real and synthetic data, compared to 90% with real data alone, outperforming state-of-the-art methods (82?91.2% accuracy). These results suggest GANN?s potential as a robust tool for liver disease prediction, pending further validation.
- Research Article
20
- 10.1016/j.cmpb.2022.107019
- Jul 10, 2022
- Computer Methods and Programs in Biomedicine
Enhancing classification of cells procured from bone marrow aspirate smears using generative adversarial networks and sequential convolutional neural network
- Research Article
8
- 10.1002/ima.22719
- Mar 8, 2022
- International Journal of Imaging Systems and Technology
Nowadays, the mortality rate due to lung cancer increases rapidly worldwide as it can be classified only at the later stages. Early classification of lung cancer will help patients to take treatment and decrease the death rate. The limited dataset and diversity of data samples are the bottlenecks for early classification. In this paper, robust deep learning generative adversarial network (GAN) models are employed to enhance the dataset and to increase classification accuracy. The activation function plays an important feature‐learning role in neural networks. Since the existing activation functions suffer from various drawbacks such as vanishing gradient, dead neurons, output offset, etc., this paper proposes a novel activation function exponential mean saturation linear unit (EMSLU), which aims to speed up training, reduce network running time, and improve classification accuracy. The experiments were conducted using vanilla GAN, Wasserstein generative adversarial network, Wasserstein generative adversarial network with gradient penalty, conditional generative adversarial network, and deep convolutional generative adversarial network. Each GAN is tested with rectified linear unit, exponential linear unit, and proposed EMSLU activation functions. The results show that all the GAN's with EMSLU yields improved precision, recall, F1‐score, and accuracy.
- Research Article
21
- 10.1109/jbhi.2023.3236722
- Aug 1, 2023
- IEEE Journal of Biomedical and Health Informatics
The aim of this study is to apply and characterize eXplainable AI (XAI) to assess the quality of synthetic health data generated using a data augmentation algorithm. In this exploratory study, several synthetic datasets are generated using various configurations of a conditional Generative Adversarial Network (GAN) from a set of 156 observations related to adult hearing screening. A rule-based native XAI algorithm, the Logic Learning Machine, is used in combination with conventional utility metrics. The classification performance in different conditions is assessed: models trained and tested on synthetic data, models trained on synthetic data and tested on real data, and models trained on real data and tested on synthetic data. The rules extracted from real and synthetic data are then compared using a rule similarity metric. The results indicate that XAI may be used to assess the quality of synthetic data by (i) the analysis of classification performance and (ii) the analysis of the rules extracted on real and synthetic data (number, covering, structure, cut-off values, and similarity). These results suggest that XAI can be used in an original way to assess synthetic health data and extract knowledge about the mechanisms underlying the generated data.
- Research Article
4
- 10.3171/2025.4.focus25225
- Jul 1, 2025
- Neurosurgical focus
Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.
- Conference Article
27
- 10.1109/sgsma.2019.8784681
- May 1, 2019
This paper concerns with the production of synthetic phasor measurement unit (PMU) data for research and education purposes. Due to the confidentiality of real PMU data and no public access to the real power systems infrastructure information, the lack of credible realistic data becomes a growing concern. Instead of constructing synthetic power grids and then producing synthetic PMU measurement data by time simulations, we propose a model-free approach to directly generate synthetic PMU data. we train the generative adversarial network (GAN) with real PMU data, which can be used to generate synthetic PMU data capturing the system dynamic behaviors. To validate the sequential generation by GAN to mimic PMU data, we theoretically analyze GAN's capacity of learning system dynamics. Further by evaluating the synthetic PMU data by a proposed quantitative method, we verify GAN's potential to synthesize realistic samples and meanwhile realize that GAN model in this paper still has room to improve. Moreover it is the first time that such generative model is applied to synthesize PMU data.
- Research Article
10
- 10.3390/app11062787
- Mar 20, 2021
- Applied Sciences
Fermentation is an age-old technique used to preserve food by restoring proper microbial balance. Boiled barley and nuruk are fermented for a short period to produce Shindari, a traditional beverage for the people of Jeju, South Korea. Shindari has been proven to be a drink of multiple health benefits if fermented for an optimal period. It is necessary to predict the ideal fermentation time required by each microbial community to keep the advantages of the microorganisms produced by the fermentation process in Shindari intact and to eliminate contamination. Prediction through machine learning requires past data but the process of obtaining fermentation data of Shindari is time consuming, expensive, and not easily available. Therefore, there is a need to generate synthetic fermentation data to explore various benefits of the drink and to reduce any risk from overfermentation. In this paper, we propose a model that takes incomplete tabular fermentation data of Shindari as input and uses multiple imputation ensemble (MIE) and generative adversarial networks (GAN) to generate synthetic fermentation data that can be later used for prediction and microbial spoilage control. For multiple imputation, we used multivariate imputation by chained equations and random forest imputation, and ensembling was done using the bagging and stacking method. For generating synthetic data, we remodeled the tabular GAN with skip connections and adapted the architecture of Wasserstein GAN with gradient penalty. We compared the performance of our model with other imputation and ensemble models using various evaluation metrics and visual representations. Our GAN model could overcome the mode collapse problem and converged at a faster rate than existing GAN models for synthetic data generation. Experiment results show that our proposed model executes with less error, is more accurate, and generates significantly better synthetic fermentation data compared to other models.
- Abstract
1
- 10.1182/blood-2022-171057
- Nov 15, 2022
- Blood
Systematic Evaluation of Synthetic Panel Data Quality with an Application to Chronic Lymphocytic Leukemia
- Abstract
2
- 10.1182/blood-2022-168646
- Nov 15, 2022
- Blood
Synthetic Data Generation By Artificial Intelligence to Accelerate Translational Research and Precision Medicine in Hematological Malignancies
- Research Article
39
- 10.1016/j.eswa.2022.117936
- Jun 27, 2022
- Expert Systems with Applications
Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems
- Research Article
6
- 10.3390/stats7030049
- Aug 3, 2024
- Stats
The lack of data on flood events poses challenges in flood management. In this paper, we propose a novel approach to enhance flood-forecasting models by utilizing the capabilities of Generative Adversarial Networks (GANs) to generate synthetic flood events. We modified a time-series GAN by incorporating constraints related to mass conservation, energy balance, and hydraulic principles into the GAN model through appropriate regularization terms in the loss function and by using mass conservative LSTM in the generator and discriminator models. In this way, we can improve the realism and physical consistency of the generated extreme flood-event data. These constraints ensure that the synthetic flood-event data generated by the GAN adhere to fundamental hydrological principles and characteristics, enhancing the accuracy and reliability of flood-forecasting and risk-assessment applications. PCA and t-SNE are applied to provide valuable insights into the structure and distribution of the synthetic flood data, highlighting patterns, clusters, and relationships within the data. We aimed to use the generated synthetic data to supplement the original data and train probabilistic neural runoff model for forecasting multi-step ahead flood events. t-statistic was performed to compare the means of synthetic data generated by TimeGAN with the original data, and the results showed that the means of the two datasets were statistically significant at 95% level. The integration of time-series GAN-generated synthetic flood events with real data improved the robustness and accuracy of the autoencoder model, enabling more reliable predictions of extreme flood events. In the pilot study, the model trained on the augmented dataset with synthetic data from time-series GAN shows higher NSE and KGE scores of NSE = 0.838 and KGE = 0.908, compared to the NSE = 0.829 and KGE = 0.90 of the sixth hour ahead, indicating improved accuracy of 9.8% NSE in multistep-ahead predictions of extreme flood events compared to the model trained on the original data alone. The integration of synthetic training datasets in the probabilistic forecasting improves the model’s ability to achieve a reduced Prediction Interval Normalized Average Width (PINAW) for interval forecasting, yet this enhancement comes with a trade-off in the Prediction Interval Coverage Probability (PICP).
- Research Article
60
- 10.3390/app12147075
- Jul 13, 2022
- Applied Sciences
Modern machine and deep learning methods require large datasets to achieve reliable and robust results. This requirement is often difficult to meet in the medical field, due to data sharing limitations imposed by privacy regulations or the presence of a small number of patients (e.g., rare diseases). To address this data scarcity and to improve the situation, novel generative models such as Generative Adversarial Networks (GANs) have been widely used to generate synthetic data that mimic real data by representing features that reflect health-related information without reference to real patients. In this paper, we consider several GAN models to generate synthetic data used for training binary (malignant/benign) classifiers, and compare their performances in terms of classification accuracy with cases where only real data are considered. We aim to investigate how synthetic data can improve classification accuracy, especially when a small amount of data is available. To this end, we have developed and implemented an evaluation framework where binary classifiers are trained on extended datasets containing both real and synthetic data. The results show improved accuracy for classifiers trained with generated data from more advanced GAN models, even when limited amounts of original data are available.
- Conference Article
111
- 10.1109/cvprw.2019.00305
- Jun 1, 2019
Calibrating sports cameras is important for autonomous broadcasting and sports analysis. Here we propose a highly automatic method for calibrating sports cameras from a single image using synthetic data. First, we develop a novel camera pose engine. The camera pose engine has only three significant free parameters so that it can effectively generate a lot of camera poses and corresponding edge (i.e, field marking) images. Then, we learn compact deep features via a siamese network from paired edge image and camera pose and build a feature-pose database. After that, we use a novel two-GAN (generative adversarial network) model to detect field markings in real images. Finally, we query an initial camera pose from the feature-pose database and refine camera poses using truncated distance images. We evaluate our method on both synthetic and real data. Our method not only demonstrates the robustness on the synthetic data but also achieves the state-of-the-art accuracy on a standard soccer dataset and very high performance on a volleyball dataset.
- Research Article
11
- 10.1088/1742-6596/1577/1/012027
- Jul 1, 2020
- Journal of Physics: Conference Series
Continuous numerical is a type of data which often used for unsupervised learning such as clustering. However, this valuable data often provided in a small amount because it is hard to obtain, expensive, required an expert to collect them, or not available because it contains confidential information that cannot be published. These limited data situations can be an obstacle for processing and analyzing data or restrain clustering related research in general. Therefore, there is a need to be an alternative that can replace or increase the amount of data. The proposed method is generating synthetic continuous numerical data using Generative Adversarial Networks (GANs). This study used two GAN architectures (GAN and CGAN) and focused on unlabeled continuous numerical data to provide replacement or additional data for the clustering task. The Quality of synthetic data was measured using the accuracy of the xgboost algorithm in classifying real and synthetic data. When the xgboost accuracy of perfectly realistic data is 50%, synthetic data based on CGAN achieving 63%. The result of this study shows that GAN can generate data similar enough and not significantly different from the real data.
- Research Article
21
- 10.3389/fmicb.2022.1059123
- Dec 22, 2022
- Frontiers in Microbiology
Protective coatings based on two dimensional materials such as graphene have gained traction for diverse applications. Their impermeability, inertness, excellent bonding with metals, and amenability to functionalization renders them as promising coatings for both abiotic and microbiologically influenced corrosion (MIC). Owing to the success of graphene coatings, the whole family of 2D materials, including hexagonal boron nitride and molybdenum disulphide are being screened to obtain other promising coatings. AI-based data-driven models can accelerate virtual screening of 2D coatings with desirable physical and chemical properties. However, lack of large experimental datasets renders training of classifiers difficult and often results in over-fitting. Generate large datasets for MIC resistance of 2D coatings is both complex and laborious. Deep learning data augmentation methods can alleviate this issue by generating synthetic electrochemical data that resembles the training data classes. Here, we investigated two different deep generative models, namely variation autoencoder (VAE) and generative adversarial network (GAN) for generating synthetic data for expanding small experimental datasets. Our model experimental system included few layered graphene over copper surfaces. The synthetic data generated using GAN displayed a greater neural network system performance (83-85% accuracy) than VAE generated synthetic data (78-80% accuracy). However, VAE data performed better (90% accuracy) than GAN data (84%-85% accuracy) when using XGBoost. Finally, we show that synthetic data based on VAE and GAN models can drive machine learning models for developing MIC resistant 2D coatings.