GLSTM: A novel approach for prediction of real &amp; synthetic PID diabetes data using GANs and LSTM classification model

Sushma Jaiswal,Priyanka Gupta

doi:10.52756/ijerr.2023.v30.004

Abstract

Generative Adversarial Network (GAN) is a revolution in modern artificial systems. Deep learning-based Generative adversarial networks generate realistic synthetic tabular data. Synthetic data are used to enhance the size of a relatively small training dataset while ensuring the confidentiality of the original data. In this context, we implemented the GAN framework for generating diabetes data to help the health care professional in more clinical applications. GAN is used to validate the Pima Indian Diabetes (PID) Dataset. Various preprocessing techniques, such as handling missing values, outliers and data imbalance problems, enhance data quality. Some exploratory data analyses, such as heat maps, bar graphs and histograms, are used for data visualisation. We employed hypothesis testing to examine the resemblance between real data and GAN-generated synthetic data. In this study, we proposed a GAN-Long Short-Term Memory (GLSTM) system, in which GAN is used for data augmentation, and LSTM is used for diabetes classification. Additionally, various GAN models such as CTGAN, Vanilla GAN, Coupula GAN, Gaussian Coupula GAN, and TVAE GAN are used to generate the synthetic dataset. Experiments were conducted on real data, synthetic data, and by combining real and synthetic data. The model that used both real and synthetic data obtained a substantially better accuracy of 97% compared to 92% when only real data was used. We also observed that synthetic data could be used in place of real data, as the mean correlation between synthetic and real data is 0.93. Our study's findings outperformed when compared to state-of-the-art methodologies.

Full Text