Enhancing buckwheat maturity classification with generative adversarial networks for spectroscopy data augmentation
IntroductionThe optimal harvest period for buckwheat is challenging to determine due to its short growth cycle. Harvesting too early or too late can negatively affect the quality of the crop. Traditional harvest methods are labor-intensive and fail to account for the spatial variability in buckwheat quality within a field. This study explores the use of near-infrared (NIR) spectral data to classify the maturity stages of buckwheat.MethodFour distinct developmental stages were examined: UM (Unripe Maturity), representing buckwheat harvested at 65 days after sowing; HM (Half Maturity), harvested at 75 days; MS (Full Maturity with Shell), harvested at 85 days with husks intact; and MUS (Full Maturity Unhulled Sample), also harvested at 85 days but manually dehulled. Unlike traditional machine learning models, which require diverse and extensive datasets, this study investigates the use of a conditional WGAN-GP to generate synthetic datasets and improve model performance. Four machine learning models were employed in this study: Support Vector Machine (SVM), Random Forest (RF), k-Nearest Neighbors (KNN), and Partial Least Squares Linear Discriminant Analysis (PLS-LDA).Results and DiscussionThe conditional WGAN with the gradient penalty was trained for a range of epochs: 1000, 2000, 8000, 10,000, and 20,000. After training 10,000 epochs, synthetic hyperspectral reflectance data were very similar to real spectra for each maturity category. To assess the impact of conditional WGAN-GP data augmentation, model performance was first evaluated using the original dataset as a baseline, showing PLS-LDA had the best classification performance with accuracy of 95% and kappa coefficient of 0.93. The models were then trained on a combination of original and synthetic data, revealing that synthetic data can improve the classification model performance for RF and KNN. The best classification performance was achieved by RF with an accuracy of 97% and kappa coefficient of 0.94. This study demonstrates the effectiveness of synthetic data in enhancing classification accuracy.
- Research Article
160
- 10.2196/18910
- Jul 20, 2020
- JMIR Medical Informatics
BackgroundThe exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce.ObjectiveThis work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data.MethodsA total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed.ResultsA total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility.ConclusionsThe results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.
- Research Article
8
- 10.3390/app142310818
- Nov 22, 2024
- Applied Sciences
This study presents a novel approach using Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP) to generate synthetic electroencephalography (EEG) and electrocardiogram (ECG) waveforms. The synthetic EEG data represent concentration and relaxation mental states, while the synthetic ECG data correspond to normal and abnormal states. By addressing the challenges of limited biophysical data, including privacy concerns and restricted volunteer availability, our model generates realistic synthetic waveforms learned from real data. Combining real and synthetic datasets improved classification accuracy from 92% to 98.45%, highlighting the benefits of dataset augmentation for machine learning performance. The WGAN-GP model achieved 96.84% classification accuracy for synthetic EEG data representing relaxation states and optimal accuracy for concentration states when classified using a fusion of convolutional neural networks (CNNs). A 50% combination of synthetic and real EEG data yielded the highest accuracy of 98.48%. For EEG signals, the real dataset consisted of 60-s recordings across four channels (TP9, AF7, AF8, and TP10) from four individuals, providing approximately 15,000 data points per subject per state. For ECG signals, the dataset contained 1200 real samples, each comprising 140 data points, representing normal and abnormal states. WGAN-GP outperformed a basic generative adversarial network (GAN) in generating reliable synthetic data. For ECG data, a support vector machine (SVM) classifier achieved an accuracy of 98% with real data and 95.8% with synthetic data. Synthetic ECG data improved the random forest (RF) classifier’s accuracy from 97% with real data alone to 98.40% when combined with synthetic data. Statistical significance was assessed using the Wilcoxon signed-rank test, demonstrating the robustness of the WGAN-GP model. Techniques such as discrete wavelet transform, downsampling, and upsampling were employed to enhance data quality. This method shows significant potential in addressing biophysical data scarcity and advancing applications in assistive technologies, human-robot interaction, and mental health monitoring, among other medical applications.
- Conference Article
- 10.54941/ahfe1006801
- Jan 1, 2025
- AHFE international
The application of synthetic data within the biomedical domain is rapidly gaining momentum, driven by the growing need for robust datasets suitable for machine learning (ML) and statistical modeling. In scenarios where access to real patient data is limited due to privacy concerns or scarcity, synthetic data offers an attractive alternative. These artificially generated datasets aim to mimic the statistical characteristics of original data, enabling researchers to conduct exploratory analysis, develop predictive models, or validate findings without compromising patient confidentiality. However, the increasing use of synthetic data raises several methodological and interpretative challenges, particularly regarding the correct sequence and context for applying statistical analyses. One of the central issues identified in contemporary literature concerns the timing of data analysis relative to the synthetic data generation process. Some studies conduct statistical or ML analyses directly on real datasets and use synthetic data for validation or augmentation. Others, conversely, perform all stages of analysis including feature importance estimation, correlation assessment, and model training on synthetic data. This inconsistency raises the question of whether statistical analysis conducted solely on synthetic datasets yields reliable insights, or whether it constitutes a methodological flaw. The prevailing assumption is that analysis should ideally be performed on real data to preserve statistical integrity, but empirical evaluation of this notion remains limited. In the current study, the authors address this issue by applying a synthetic data generation method specifically, the Tabular Variational Auto encoder (TVAE) to a biomedical dataset focused on bladder cancer recurrence. This dataset includes various diagnostic variables, and the primary goal is to assess how well synthetic data replicates analytical insights drawn from the original data. To achieve this, the authors conduct both correlational analysis and machine learning-based feature importance estimation. The results derived from synthetic datasets of varying sizes are then compared to those obtained from the original data. The findings indicate that while synthetic data can approximate general trends observed in the original dataset, there are notable differences depending on the analytical technique employed. In particular, models such as Random Forest appear more sensitive to variations introduced during the synthetization process. This sensitivity manifests as shifts in feature importance rankings and variability in predictive performance, especially when working with smaller synthetic datasets. On the other hand, simpler statistical methods such as correlation coefficients display more stability, suggesting that some analytical approaches may be more robust to data generation artifacts than others. These observations underscore the importance of methodological caution when interpreting results based on synthetic biomedical data. While synthetic datasets hold considerable promise for advancing data-driven research in biomedicine, they are not a one-size-fits-all solution. The sequence in which synthetic data is introduced into the research pipeline whether before or after statistical analysis—can significantly influence the validity of the findings. As such, researchers must critically assess the suitability of synthetic data for specific analytical tasks and ensure transparency in reporting their methodological choices. Future work should further explore the impact of different generative models and dataset properties on the reliability of synthetic-data-driven insights.
- Research Article
8
- 10.59796/jcst.v15n2.2025.99
- Mar 25, 2025
- Journal of Current Science and Technology
Road traffic accidents (RTAs) pose a significant global challenge, particularly in Thailand. This study investigates the impact of resampling techniques on machine learning (ML) models for classifying road accident severity in Thailand, utilizing data from 31,817 road traffic accidents collected between January 1, 2021, and December 31, 2022. The primary challenge addressed is class imbalance, where fatal accidents represent a small fraction of the dataset. Three popular ML models, including Random Forest (RF), K-Nearest Neighbors (KNN), and Extreme Gradient Boosting (XGB), were evaluated with four resampling techniques: Imbalanced (IB), Under-sampling (US), Over-sampling (OS), and Combined Sampling (CS). These resampling approaches generated 12 ML models, whose performance was evaluated under three different train/test split ratios: 70/30, 80/20, and 90/10. Compared to the IB approach, the results demonstrate that all US, OS and CS techniques significantly improved model performance, particularly in terms of F1 score, G-mean, and balanced accuracy. Among the models, RF-CS, KNN-OS, and XGB-CS exhibited the best classification performance. Although these evaluation metrics improved over the imbalanced scheme, KNN’s overall performance in detecting fatal accidents was weaker compared to RF and XGB. Specifically, KNN struggled more with the imbalanced dataset, even after applying resampling techniques. These findings suggest that choosing the appropriate resampling techniques is crucial for enhancing model performance in classifying accident severity.
- Research Article
4
- 10.3171/2025.4.focus25225
- Jul 1, 2025
- Neurosurgical focus
Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.
- Research Article
17
- 10.1016/j.autcon.2025.106208
- Jul 1, 2025
- Automation in Construction
Recent advancements in AI-based Digital Twins (DTs) have substantially influenced bridge monitoring and maintenance, especially through Deep Learning (DL) for sensor-based damage detection. However, the effectiveness of DL models is constrained by the extensive training data they require, which is often costly and time-consuming to collect in bridge infrastructure contexts. To address this data scarcity, this paper proposes a data augmentation strategy employing a transformer-based time-series Wasserstein generative adversarial network with gradient penalty (TTS-WGAN-GP) to generate synthetic acceleration data. The synthetic data's fidelity is validated through similarity metrics and frequency domain analysis, showing close alignment with real acceleration signals for damage detection. Results demonstrate that this method achieves high-quality synthetic data with superior computational efficiency compared to existing approaches, improving dataset balancing and potentially enhancing the performance of data-driven models in DTs. This approach reduces dependence on extensive data collection, supporting reliable bridge health monitoring applications. • Development of a GAN model for generating synthetic bridge acceleration data. • Introduction of a new similarity metric for assessing synthetic data quality in bridge damage detection. • Comparative analysis of the proposed model against existing GAN methods in structural health monitoring (SHM) • Application of the proposed model to generate synthetic acceleration signals based on real-world data from the Werrington Bridge.
- Research Article
18
- 10.3389/fbioe.2024.1350135
- Feb 14, 2024
- Frontiers in Bioengineering and Biotechnology
Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.
- Research Article
18
- 10.2196/47859
- Nov 24, 2023
- JMIR Medical Informatics
Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information. This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships. The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)-based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models. The synthetic data of the 3 diseases (non-small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better. This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.
- Preprint Article
- 10.31224/4629
- May 15, 2025
Artificial Intelligence is growing rapidly in a highly interconnected world, providing solutions to problems that were unimaginable just a few years ago, while at the same time opening the door to existential risks and dangers for humanity. Keeping its development under control and respecting the individual is one of the main goals of Human-Centred Artificial Intelligence, a branch of computer science that has emerged in the last decade and aims to make the research, production and use of Artificial Intelligence algorithms transparent, credible, safe and ethical. With the advent of cyber-attacks against such algorithms, regulation and protection have become imperative. Through the use of certain Artificial Intelligence models, it is indeed possible to extract the information learned by third party algorithms, showing how the training data is present in these architectures, albeit in the form of a latent representation. Data, whatever its nature or form, is thus one of the most important and debated resources, being on the one hand the essential ingredient for learning algorithms, and on the other hand an asset to be protected and kept private. This thesis begins by examining the current landscape of privacy preservation techniques in deep learning, revealing significant challenges in balancing model performance with data protection. Existing methods, including Differential Privacy, often result in substantial compromises with respect to privacy guarantees and model performance, limiting their practical application in real-world scenarios. In response to these challenges, this research introduces a series of novel contributions aimed at enhancing both privacy and performance in deep learning systems. Initially, it explores regularisation techniques as a means to improve privacy protection whilst maintaining model performance. This approach proves to be a promising alternative to more computationally intensive methods, offering a better balance between privacy and utility. Building upon this foundation, the work presents Discriminative Adversarial Privacy (DAP), a new strategy that leverages adversarial training to simultaneously optimise for task performance and privacy protection. This approach demonstrates significant improvements over traditional methods, offering a more favourable balance between model accuracy and privacy guarantees. The thesis then investigates the potential of federated learning as a privacy-preserving technique for collaborative model development. Recognising the vulnerabilities inherent in traditional approaches, it proposes Synthetic Generative Data Exchange (SGDE). This innovative method leverages generative models to produce synthetic data for exchange within a federated learning context, significantly enhancing privacy protections whilst maintaining or even improving model performance. Expanding on the concept of synthetic data, a comprehensive pipeline called Gap Filler (GaFi) is developed to optimise the quality and utility of synthetic datasets for downstream tasks. This approach significantly narrows the performance gap between models trained on synthetic versus real-world data across various domains. Additionally, the research explores the adaptation of Stable Diffusion 2.0 for synthetic dataset generation, incorporating techniques such as transfer learning and fine-tuning. Building upon these advancements, the Knowledge Recycling (KR) pipeline is introduced, which integrates and refines the insights from GaFi and the Stable Diffusion experiments. KR employs advanced generative techniques to further enhance the effectiveness of synthetic data in model training, demonstrating its potential to surpass real data in certain scenarios. In the context of collaborative learning, this research proposes Federated Knowledge Recycling (FedKR). This novel approach enables secure and effective collaboration across institutions without compromising data privacy. By leveraging locally generated synthetic data and sophisticated aggregation mechanisms, it offers enhanced security and improved model performance compared to traditional federated learning techniques. In conclusion, this thesis presents a series of methodologies and techniques that contribute to the ongoing development of privacy-preserving deep learning. The proposed approaches offer potential solutions to some of the current challenges in balancing data utility and privacy in machine learning applications.
- Research Article
9
- 10.1115/1.4062741
- Jul 14, 2023
- ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part B: Mechanical Engineering
Despite the pipeline network being the safest mode of oil and gas transportation systems, the pipeline failure rate has increased significantly over the last decade, particularly for aging pipelines. Predicting failure risk and prioritizing the riskiest asset from a large set of pipelines is one of the demanding tasks for the utilities. Machine learning (ML) application in pipeline failure risk prediction has recently shown promising results. However, due to safety and security concerns, obtaining sufficient operation and failure data to train ML models accurately is a significant challenge. This study employed a Generative Adversarial Network (GAN) based framework to generate synthetic pipeline data (DSyn) using a subset (70%) of experimental burst test results data (DExp) compiled from the literature to overcome the limitation of accessing operational data. The proposed framework was tested on (1) real data, and (2) combined real and generated synthetic data. The burst failure risk of corroded oil and gas pipelines was determined using probabilistic approaches, and pipelines were classified into two classes depending on their probability of failure: (1) low failure risk (Pf: 0–0.5) and (2) high failure risk (Pf: >0.5). Two random forest (RF) models (MExp and MComb) were trained using a subset of 70% of actual experimental pipeline data, (DExp) and a combination of 70% of actual experimental and 100% of synthetic data, respectively. These models were validated on the remaining subset (30%) of experimental test data. The validation results reveal that adding synthetic data can further improve the performance of the ML models. The area under the ROC Curve was found to be 0.96 and 0.99 for real model (MExp) and combined model (MComb) data, respectively. The combined model with improved performance can be used in strategic oil and gas pipeline resilience improvement planning, which sets long-term critical decisions regarding maintenance and potential replacement of pipes.
- Research Article
20
- 10.24203/ajcis.v10i1.6882
- Feb 27, 2022
- Asian Journal of Computer and Information Systems
Student’s mental health problems have been explored previously in higher education literature in various contexts including empirical work involving quantitative and qualitative methods. Nevertheless, comparatively few research could be found, aiming for computational methods that learn information directly from data without relying on set parameters for a predetermined equation as an analytical method. This study aims to investigate the performance of Machine learning (ML) models used in higher education. ML models considered are Naïve Bayes, Support Vector Machine, K-Nearest Neighbor, Logistic Regression, Stochastic Gradient Descent, Decision Tree, Random Forest, XGBoost (Extreme Gradient Boosting Decision Tree), and NGBoost (Natural) algorithm. Considering the factors of mental health illness among students, we follow three phases of data processing: segmentation, feature extraction, and classification. We evaluate these ML models against classification performance metrics such as accuracy, precision, recall, F1 score, and predicted run time. The empirical analysis includes two contributions: 1. It examines the performance of various ML models on a survey-based educational dataset, inferring a significant classification performance by a tree-based XGBoost algorithm; 2. It explores the feature importance [variables] from the datasets to infer the significant importance of social support, learning environment, and childhood adversities on a student’s mental health illness.
- Research Article
29
- 10.1097/tp.0000000000003640
- Nov 22, 2021
- Transplantation
Several groups have previously developed logistic regression models for predicting delayed graft function (DGF). In this study, we used an automated machine learning (ML) modeling pipeline to generate and optimize DGF prediction models en masse. Deceased donor renal transplants at our institution from 2010 to 2018 were included. Input data consisted of 21 donor features from United Network for Organ Sharing. A training set composed of ~50%/50% split in DGF-positive and DGF-negative cases was used to generate 400 869 models. Each model was based on 1 of 7 ML algorithms (gradient boosting machine, k-nearest neighbor, logistic regression, neural network, naive Bayes, random forest, support vector machine) with various combinations of feature sets and hyperparameter values. Performance of each model was based on a separate secondary test dataset and assessed by common statistical metrics. The best performing models were based on neural network algorithms, with the highest area under the receiver operating characteristic curve of 0.7595. This model used 10 out of the original 21 donor features, including age, height, weight, ethnicity, serum creatinine, blood urea nitrogen, hypertension history, donation after cardiac death status, cause of death, and cold ischemia time. With the same donor data, the highest area under the receiver operating characteristic curve for logistic regression models was 0.7484, using all donor features. Our automated en masse ML modeling approach was able to rapidly generate ML models for DGF prediction. The performance of the ML models was comparable with classic logistic regression models.
- Research Article
8
- 10.1016/j.egyai.2023.100308
- Oct 13, 2023
- Energy and AI
Generation of meaningful synthetic sensor data — Evaluated with a reliable transferability methodology
- Research Article
5
- 10.1038/s41598-025-15019-3
- Sep 29, 2025
- Scientific Reports
The challenges of handling imbalanced datasets in machine learning significantly affect the model performance and predictive accuracy. Classifiers tend to favor the majority class, leading to biased training and poor generalization of minority classes. Initially, the model incorrectly treats the target variable as an independent feature during data generation, resulting in suboptimal outcomes. To address this limitation, the model was adjusted to more effectively manage target variable generation and mitigate the issue. This study employed advanced techniques for synthetic data generation, such as synthetic minority oversampling (SMOTE) and Adaptive Synthetic Sampling (ADASYN), to enhance the representation of minority classes by generating synthetic samples. In addition, data augmentation strategies using Deep Conditional Tabular Generative Adversarial Networks (Deep-CTGANs) integrated with ResNet have been utilized to improve model robustness and overall generalizability. For classification, TabNet, a model tailored specifically for tabular data, proved highly effective with its sequential attention mechanism that dynamically processes features, making it well suited for handling complex and imbalanced datasets. Model performance was evaluated using a novel approach of training synthetic data and testing on real data (TSTR). The framework was validated on the COVID-19, Kidney, and Dengue datasets, achieving impressive testing accuracies of 99.2%, 99.4%, and 99.5%, respectively. Furthermore, similarity scores of 84.25%, 87.35%, and 86.73% between the real and synthetic data for the COVID-19, Kidney, and Dengue datasets, respectively, confirmed the reliability of the synthetic data. TabNet consistently showed substantial improvements in F1-scores compared to other models, such as Random Forest, XGBoost, and KNN, emphasizing the importance of selecting the right synthetic data augmentation techniques and classifiers. Additionally, SHapley Additive exPlanations (SHAP)-based explainable AI tools were used to interpret model performance, providing insights into feature importance and its impact on predictions. These findings confirm that the proposed approach enhances the accuracy, robustness, and interpretability, offering a valuable solution for addressing data imbalance in classification tasks.
- Research Article
3
- 10.3389/frai.2025.1530397
- Mar 19, 2025
- Frontiers in artificial intelligence
AI fairness seeks to improve the transparency and explainability of AI systems by ensuring that their outcomes genuinely reflect the best interests of users. Data augmentation, which involves generating synthetic data from existing datasets, has gained significant attention as a solution to data scarcity. In particular, diffusion models have become a powerful technique for generating synthetic data, especially in fields like computer vision. This paper explores the potential of diffusion models to generate synthetic tabular data to improve AI fairness. The Tabular Denoising Diffusion Probabilistic Model (Tab-DDPM), a diffusion model adaptable to any tabular dataset and capable of handling various feature types, was utilized with different amounts of generated data for data augmentation. Additionally, reweighting samples from AIF360 was employed to further enhance AI fairness. Five traditional machine learning models-Decision Tree (DT), Gaussian Naive Bayes (GNB), K-Nearest Neighbors (KNN), Logistic Regression (LR), and Random Forest (RF)-were used to validate the proposed approach. Experimental results demonstrate that the synthetic data generated by Tab-DDPM improves fairness in binary classification.