Synthetic neurosurgical data generation with generative adversarial networks and large language models:an investigation on fidelity, utility, and privacy.
Use of neurosurgical data for clinical research and machine learning (ML) model development is often limited by data availability, sample sizes, and regulatory constraints. Synthetic data offer a potential solution to challenges associated with accessing, sharing, and using real-world data (RWD). The aim of this study was to evaluate the capability of generating synthetic neurosurgical data with a generative adversarial network and large language model (LLM) to augment RWD, perform secondary analyses in place of RWD, and train an ML model to predict postoperative outcomes. Synthetic data were generated with a conditional tabular generative adversarial network (CTGAN) and the LLM GPT-4o based on a real-world neurosurgical dataset of 140 older adults who underwent neurosurgical interventions. Each model was used to generate datasets at equivalent (n = 140) and amplified (n = 1000) sample sizes. Data fidelity was evaluated by comparing univariate and bivariate statistics to the RWD. Privacy evaluation involved measuring the uniqueness of generated synthetic records. Utility was assessed by: 1) reproducing and extending clinical analyses on predictors of Karnofsky Performance Status (KPS) deterioration at discharge and a prolonged postoperative intensive care unit (ICU) stay, and 2) training a binary ML classifier on amplified synthetic datasets to predict KPS deterioration on RWD. Both the CTGAN and GPT-4o generated complete, high-fidelity synthetic tabular datasets. GPT-4o matched or exceeded CTGAN across all measured fidelity, utility, and privacy metrics. All significant clinical predictors of KPS deterioration and prolonged ICU stay were retained in the GPT-4o-generated synthetic data, with some differences observed in effect sizes. Preoperative KPS was not preserved as a significant predictor in the CTGAN-generated data. The ML classifier trained on GPT-4o data outperformed the model trained on CTGAN data, achieving a higher F1 score (0.725 vs 0.688) for predicting KPS deterioration. This study demonstrated a promising ability to produce high-fidelity synthetic neurosurgical data using generative models. Synthetic neurosurgical data present a potential solution to critical limitations in data availability for neurosurgical research. Further investigation is necessary to enhance synthetic data utility for secondary analyses and ML model training, and to evaluate synthetic data generation methods across other datasets, including clinical trial data.
- Conference Article
2
- 10.54941/ahfe1005349
- Jan 1, 2024
- AHFE international
Collaborative robots, or cobots, are one of the Industry 4.0 technologies that have and continue to change many industrial procedures. However, amid this technological advancement, the persisting physical strain on human workers remains a significant concern. Even with the advent of cobots aimed at alleviating burdensome tasks, certain physical jobs continue to induce fatigue in human workers. Addressing this challenge necessitates the development of robust solutions that combine technological innovation with human-centric considerations. One critical aspect in mitigating physical fatigue in human workers involves the application of Machine Learning (ML) models. These models heavily depend on data obtained from real-world situations that accurately represent the complexities of physical strain. However, this kind of data is frequently limited and costly to gather using sensors, which hinders the development of an effective ML model. This scarcity underscores the need for alternative approaches, with Synthetic Data Generation (SDG) emerging as a viable solution to this problem. The production of synthetic data offers a new approach to address the lack of relevant data needed to train machine learning algorithms. By employing techniques like Tabular Generative Adversarial Networks (GANs), synthetic datasets can be created, simulating realistic human physical fatigue detection features. Tabular GANs have, for example, been shown to be effective in creating synthetic data that closely resembles the statistical characteristics and patterns of real-world datasets. Furthermore, tabular GANs present a scalable and affordable response to the problem of data scarcity. The research reported here presents a novel approach centred on employing the Tabular GAN methodology to create synthetic datasets encompassing key features pertinent to the detection of human physical fatigue. The results of this study are expected to contribute substantially to creating robust solutions to alleviate physical strain and enhance human workers' overall well-being in industrial settings. The goal is to create datasets that accurately represent the complexities found in real-world scenarios where physical fatigue notably influences human performance. These synthetically generated datasets will serve as the foundation for training specialized ML models designed explicitly for detecting the development of human physical fatigue. The trained ML model will undergo rigorous testing and validation using a substantial repository of authentic real-world data. The model's accuracy and reliability in detecting human physical fatigue will be assessed through this evaluation process. The ultimate objective is to achieve a level of accuracy that demonstrates the model's proficiency in identifying and predicting the onset of physical fatigue in human workers within industrial settings. This research endeavours to bridge the gap between Industry 4.0 innovations and human well-being by leveraging synthetic data generation techniques to enhance the accuracy and efficiency of ML models in detecting human physical fatigue.
- Research Article
4
- 10.3390/machines13030235
- Mar 13, 2025
- Machines
There has been a growth of collaborative robots in Industry 5.0 due to the research in automation involving human-centric workplace design. It has had a substantial impact on industrial processes; however, physical exertion in human workers is still an issue, requiring solutions that combine technological innovation with human-centric development. By analysing real-world data, machine learning (ML) models can detect physical fatigue. However, sensor-based data collection is frequently used, which is often expensive and constrained. To overcome this gap, synthetic data generation (SDG) uses methods such as tabular generative adversarial networks (GANs) to produce statistically realistic datasets that improve machine learning model training while providing scalability and cost-effectiveness. This study presents an innovative approach utilising conditional GAN with auxiliary conditioning to generate synthetic datasets with essential features for detecting human physical fatigue in industrial scenarios. This approach allows us to enhance the SDG process by effectively handling the heterogeneous and imbalanced nature of human fatigue data, which includes tabular, categorical, and time-series data points. These generated datasets will be used to train specialised ML models, such as ensemble models, to learn from the original dataset from the extracted feature and then identify signs of physical fatigue. The trained ML model will undergo rigorous testing using authentic, real-world data to evaluate its sensitivity and specificity in recognising how closely generated data match with actual human physical fatigue within industrial settings. This research aims to provide researchers with an innovative method to tackle data-driven ML challenges of data scarcity and further enhance ML technology’s efficiency through training on SD. This study not only provides an approach to create complex realistic datasets but also helps in bridging the gap of Industry 5.0 data challenges for the purpose of innovations and worker well-being by improving detection capabilities.
- Research Article
18
- 10.3389/fbioe.2024.1350135
- Feb 14, 2024
- Frontiers in Bioengineering and Biotechnology
Objective: Biomechanical Machine Learning (ML) models, particularly deep-learning models, demonstrate the best performance when trained using extensive datasets. However, biomechanical data are frequently limited due to diverse challenges. Effective methods for augmenting data in developing ML models, specifically in the human posture domain, are scarce. Therefore, this study explored the feasibility of leveraging generative artificial intelligence (AI) to produce realistic synthetic posture data by utilizing three-dimensional posture data.Methods: Data were collected from 338 subjects through surface topography. A Variational Autoencoder (VAE) architecture was employed to generate and evaluate synthetic posture data, examining its distinguishability from real data by domain experts, ML classifiers, and Statistical Parametric Mapping (SPM). The benefits of incorporating augmented posture data into the learning process were exemplified by a deep autoencoder (AE) for automated feature representation.Results: Our findings highlight the challenge of differentiating synthetic data from real data for both experts and ML classifiers, underscoring the quality of synthetic data. This observation was also confirmed by SPM. By integrating synthetic data into AE training, the reconstruction error can be reduced compared to using only real data samples. Moreover, this study demonstrates the potential for reduced latent dimensions, while maintaining a reconstruction accuracy comparable to AEs trained exclusively on real data samples.Conclusion: This study emphasizes the prospects of harnessing generative AI to enhance ML tasks in the biomechanics domain.
- Research Article
9
- 10.1115/1.4062741
- Jul 14, 2023
- ASCE-ASME Journal of Risk and Uncertainty in Engineering Systems, Part B: Mechanical Engineering
Despite the pipeline network being the safest mode of oil and gas transportation systems, the pipeline failure rate has increased significantly over the last decade, particularly for aging pipelines. Predicting failure risk and prioritizing the riskiest asset from a large set of pipelines is one of the demanding tasks for the utilities. Machine learning (ML) application in pipeline failure risk prediction has recently shown promising results. However, due to safety and security concerns, obtaining sufficient operation and failure data to train ML models accurately is a significant challenge. This study employed a Generative Adversarial Network (GAN) based framework to generate synthetic pipeline data (DSyn) using a subset (70%) of experimental burst test results data (DExp) compiled from the literature to overcome the limitation of accessing operational data. The proposed framework was tested on (1) real data, and (2) combined real and generated synthetic data. The burst failure risk of corroded oil and gas pipelines was determined using probabilistic approaches, and pipelines were classified into two classes depending on their probability of failure: (1) low failure risk (Pf: 0–0.5) and (2) high failure risk (Pf: >0.5). Two random forest (RF) models (MExp and MComb) were trained using a subset of 70% of actual experimental pipeline data, (DExp) and a combination of 70% of actual experimental and 100% of synthetic data, respectively. These models were validated on the remaining subset (30%) of experimental test data. The validation results reveal that adding synthetic data can further improve the performance of the ML models. The area under the ROC Curve was found to be 0.96 and 0.99 for real model (MExp) and combined model (MComb) data, respectively. The combined model with improved performance can be used in strategic oil and gas pipeline resilience improvement planning, which sets long-term critical decisions regarding maintenance and potential replacement of pipes.
- Research Article
31
- 10.3389/fnins.2023.1219133
- Oct 2, 2023
- Frontiers in Neuroscience
IntroductionMajor depressive disorder (MDD) is the most common mental disorder worldwide, leading to impairment in quality and independence of life. Electroencephalography (EEG) biomarkers processed with machine learning (ML) algorithms have been explored for objective diagnoses with promising results. However, the generalizability of those models, a prerequisite for clinical application, is restricted by small datasets. One approach to train ML models with good generalizability is complementing the original with synthetic data produced by generative algorithms. Another advantage of synthetic data is the possibility of publishing the data for other researchers without risking patient data privacy. Synthetic EEG time-series have not yet been generated for two clinical populations like MDD patients and healthy controls.MethodsWe first reviewed 27 studies presenting EEG data augmentation with generative algorithms for classification tasks, like diagnosis, for the possibilities and shortcomings of recent methods. The subsequent empirical study generated EEG time-series based on two public datasets with 30/28 and 24/29 subjects (MDD/controls). To obtain baseline diagnostic accuracies, convolutional neural networks (CNN) were trained with time-series from each dataset. The data were synthesized with generative adversarial networks (GAN) consisting of CNNs. We evaluated the synthetic data qualitatively and quantitatively and finally used it for re-training the diagnostic model.ResultsThe reviewed studies improved their classification accuracies by between 1 and 40% with the synthetic data. Our own diagnostic accuracy improved up to 10% for one dataset but not significantly for the other. We found a rich repertoire of generative models in the reviewed literature, solving various technical issues. A major shortcoming in the field is the lack of meaningful evaluation metrics for synthetic data. The few studies analyzing the data in the frequency domain, including our own, show that only some features can be produced truthfully.DiscussionThe systematic review combined with our own investigation provides an overview of the available methods for generating EEG data for a classification task, their possibilities, and shortcomings. The approach is promising and the technical basis is set. For a broad application of these techniques in neuroscience research or clinical application, the methods need fine-tuning facilitated by domain expertise in (clinical) EEG research.
- Research Article
8
- 10.1016/j.egyai.2023.100308
- Oct 13, 2023
- Energy and AI
Generation of meaningful synthetic sensor data — Evaluated with a reliable transferability methodology
- Research Article
1
- 10.1093/ndt/gfad063c_5490
- Jun 14, 2023
- Nephrology Dialysis Transplantation
Background and Aims Synthetic data can be an effective supplement or alternative to real data for the training of machine learning models. Synthetic data may also be used to evaluate new tools, develop educational curricula, or remove undesirable biases in datasets. We aim to evaluate four synthetic data generation methods applied to hypertension randomized clinical trial data. Method The Systolic Blood Pressure Intervention Trial (SPRINT) trial showed that intensive BP control to SBP <120 mm Hg results in significant cardiovascular benefits in high-risk patients with hypertension compared with routine BP control to <140 mm Hg. The Synthetic Data Vault (SDV) is a Synthetic Data Generation ecosystem of libraries that allows users to easily generate new Synthetic Data that has the same format and statistical properties as the original dataset. SDV supports multiple types of data, including date-times, discrete-ordinal, categorical, and numerical. SPRINT data was pre-processed to create a single table of 140,000 patient visits with baseline variables (age, sex, race, aspirin use, estimated Glomerular Filtration Rate (eGFR)) and visit level variables (systolic and diastolic blood pressure, heart rate and total number of antihypertensive medications at end of visit). Using the SDV library for python, we used four generative models to create synthetic SPRINT data, 1. Gaussian copula model, 2. Conditional Tabular Generative adversarial network (CTGAN), 3. CopulaGan model, and 4. Tabular Variational Auto-encode (TVAE). We evaluated the results using the SDMetrics library which includes the shapes of the columns (marginal distributions), the pairwise trends between the columns (correlations), reproduce mathematical properties from your original data and new row synthesis. Finally, an overall quality score which represents an amalgamation of the marginal distribution and correlations was computed, where 0 indicates the lowest quality and 1 indicates the highest. Results Two hundred thousand synthetic patient visits were created for each method. The overall quality scores in order were 90.67% for Gaussian copula, 86.77% for TVAE, 81.03% for CTGAN’, and 79.7% for CopulaGAN. The column shape score which represents the marginal distribution was highest for Gaussian Copula (94.54%), followed by TVAE (88.44%), CTGAN (82.35%), and Copula GAN (80.27%). The column pair trend which corresponds to correlations was highest for Gaussian Copula (86.8%), followed by TAVE (85.1%), CTGAN (79.72%), and Copula GAN (79.12%). Conclusion Gaussian copula created the highest scoring synthetic SPRINT data based on the marginal distribution, correlations, and overall score. The Synthetic Data Vault is a feasible collection of methods for generation of synthetic clinical trial data for training future machine learning and AI models.
- Research Article
- 10.1186/s12873-025-01468-6
- Mar 10, 2026
- BMC emergency medicine
Bronchiolitis remains a frequent reason for hospitalization in infants during the winter season. Epidemiologic surveillance remains crucial in the era of widespread immunoprophylaxis for the leading viral agent causing bronchiolitis. We investigated the performance of classical machine learning (ML) models, Deep Learning (DL), and a pre-trained large language model (LLM) in classifying bronchiolitis diagnosis from the free-text-diagnosis field of the emergency department electronic health records (EHRs). As a secondary aim, we evaluated the diagnostic accuracy of the actual official administrative ICD-9 encoding for Bronchiolitis diagnosis. 28,557 records of infants < 1 year with complete discharge diagnoses fields were retrieved between the years 2007–2018 and manually classified by an expert pediatrician to create the gold standard diagnosis set for training the algorithm. After data pre-processing, classical ML models (Random Forest, Decision Tree, Gradient Boosting Machine, Linear Discriminant Analysis, Support Vector Machine), a Deep Learning (DL) tool, and a pre-trained LLM (GPT-5) were evaluated using balanced accuracy, sensitivity, and F1 scores. The official administrative ICD-9 encoding classification accuracy was compared to the gold standard. Overall, 1,903 of 28,557 records (6.7%) were classified as bronchiolitis by the gold standard approach. The DL model and GPT-5 outperformed traditional ML models, achieving higher sensitivities (0.97, 95%CI 0.96-1.00, and 0.98, 95% CI 0.98–0.99, respectively), F1 scores (0.96, 95% CI 0.95–0.99, and 0.99, 95% CI 0.98–0.99, respectively), and balanced accuracy (0.98, 95%CI 0.98-1.00, and 0.99, 95% CI 0.99–0.99, respectively). Traditional ML models showed sensitivities between 0.77 and 0.98, F1 scores between 0.86 and 0.96, and balanced accuracies between 0.88 and 0.96. ICD-9 codes showed sensitivity of 85.9% (95% CI 84.27–87.45), and specificity of 98.5% (95% CI 98.36–98.65). To our knowledge, this is the first study directly comparing an LLM, deep learning, and multiple classical ML models for bronchiolitis surveillance in the post-Nirsevimab era. DL and GPT-5 outperformed traditional ML-based tools in identifying bronchiolitis diagnoses and ICD-9 diagnosis coding. AI-based tools hold significant potential for improving epidemiologic surveillance of bronchiolitis from emergency department EHRs. Not applicable.
- Research Article
62
- 10.1016/j.iot.2024.101212
- May 7, 2024
- Internet of Things
As technological communication progresses, diverse datasets are exchanged across distributed environments using the Internet of Things (IoT). However, the IoT environment is vulnerable to attacking and breaching data privacy or making a robust system worse by providing attack data. To address potential risks of attacks, researchers have been conducting experiments on network intrusion detection systems (NIDS) to mitigate threats effectively. The issue of data imbalance and associated data collection costs persists, hindering the ability of machine learning (ML) models to learn malicious behaviour effectively and consequently impacting the accuracy of network threat detection. Addressing these issues, our study explores the potential of using 100% synthetic data generated via Generative Adversarial Networks (GAN) for training ML models in Network Intrusion Detection Systems (NIDS). This approach reduces the dependency on real-world data significantly, paving the way for a more flexible and ethically convenient model-building process. For the UNSW-NB15 dataset, we achieved an accuracy of 90%, a precision of 91%, a recall of 90%, and an F1 score of 89%. For the NSL-KDD dataset, our results showed an accuracy of 84%, a precision of 85%, a recall of 84%, and an F1 score of 84%. For the BoT-IoT dataset, we attained perfect scores of 100% across all metrics. These outcomes indicate that the values obtained from our analysis demonstrate high performance, yielding comparative or superior results to previous studies. Therefore, our study successfully replicates real-world network intrusion detection data, showing new opportunities for the use of generative data in cyber security.
- Research Article
75
- 10.1093/jamia/ocae103
- May 21, 2024
- Journal of the American Medical Informatics Association : JAMIA
Artificial intelligence (AI) and large language models (LLMs) can play a critical role in emergency room operations by augmenting decision-making about patient admission. However, there are no studies for LLMs using real-world data and scenarios, in comparison to and being informed by traditional supervised machine learning (ML) models. We evaluated the performance of GPT-4 for predicting patient admissions from emergency department (ED) visits. We compared performance to traditional ML models both naively and when informed by few-shot examples and/or numerical probabilities. We conducted a retrospective study using electronic health records across 7 NYC hospitals. We trained Bio-Clinical-BERT and XGBoost (XGB) models on unstructured and structured data, respectively, and created an ensemble model reflecting ML performance. We then assessed GPT-4 capabilities in many scenarios: through Zero-shot, Few-shot with and without retrieval-augmented generation (RAG), and with and without ML numerical probabilities. The Ensemble ML model achieved an area under the receiver operating characteristic curve (AUC) of 0.88, an area under the precision-recall curve (AUPRC) of 0.72 and an accuracy of 82.9%. The naïve GPT-4's performance (0.79 AUC, 0.48 AUPRC, and 77.5% accuracy) showed substantial improvement when given limited, relevant data to learn from (ie, RAG) and underlying ML probabilities (0.87 AUC, 0.71 AUPRC, and 83.1% accuracy). Interestingly, RAG alone boosted performance to near peak levels (0.82 AUC, 0.56 AUPRC, and 81.3% accuracy). The naïve LLM had limited performance but showed significant improvement in predicting ED admissions when supplemented with real-world examples to learn from, particularly through RAG, and/or numerical probabilities from traditional ML models. Its peak performance, although slightly lower than the pure ML model, is noteworthy given its potential for providing reasoning behind predictions. Further refinement of LLMs with real-world data is necessary for successful integration as decision-support tools in care settings.
- Research Article
10
- 10.1002/mrm.29970
- Dec 14, 2023
- Magnetic resonance in medicine
Machine learning (ML) has been increasingly used to quantify CEST effect. ML models are typically trained using either measured data or fully simulated data. However, training with measured data often lacks sufficient training data, whereas training with fully simulated data may introduce bias because of limited simulations pools. This study introduces a new platform that combines simulated and measured components to generate partially synthetic CEST data, and to evaluate its feasibility for training ML models to predict amide proton transfer (APT) effect. Partially synthetic CEST signals were created using an inverse summation of APT effects from simulations and the other components from measurements. Training data were generated by varying APT simulation parameters and applying scaling factors to adjust the measured components, achieving a balance between simulation flexibility and fidelity. First, tissue-mimicking CEST signals along with ground truth information were created using multiple-pool model simulations to validate this method. Second, an ML model was trained individually on partially synthetic data, in vivo data, and fully simulated data, to predict APT effect in rat brains bearing 9 L tumors. Experiments on tissue-mimicking data suggest that the ML method using the partially synthetic data is accurate in predicting APT. In vivo experiments suggest that our method provides more accurate and robust prediction than the training using in vivo data and fully synthetic data. Partially synthetic CEST data can address the challenges in conventional ML methods.
- Research Article
21
- 10.14778/3450980.3450989
- Mar 1, 2021
- Proceedings of the VLDB Endowment
Real-world data is dirty, which causes serious problems in (supervised) machine learning (ML). The widely used practice in such scenario is to first repair the labeled source (a.k.a. train) data using rule-, statistical- or ML-based methods and then use the "repaired" source to train an ML model. During production, unlabeled target (a.k.a. test) data will also be repaired, and is then fed in the trained ML model for prediction. However, this process often causes a performance degradation when the source and target datasets are dirty with different noise patterns , which is common in practice. In this paper, we propose an adaptive data augmentation approach, for handling missing data in supervised ML. The approach extracts noise patterns from target data, and adapts the source data with the extracted target noise patterns while still preserving supervision signals in the source. Then, it patches the ML model by retraining it on the adapted data, in order to better serve the target. To effectively support adaptive data augmentation, we propose a novel generative adversarial network (GAN) based framework, called DAGAN, which works in an unsupervised fashion. DAGAN consists of two connected GAN networks. The first GAN learns the noise pattern from the target, for target mask generation. The second GAN uses the learned target mask to augment the source data, for source data adaptation. The augmented source data is used to retrain the ML model. Extensive experiments show that our method significantly improves the ML model performance and is more robust than the state-of-the-art missing data imputation solutions for handling datasets with different missing value patterns.
- Research Article
1
- 10.2214/ajr.25.33759
- Oct 29, 2025
- AJR. American journal of roentgenology
BACKGROUND. Examination protocoling is a resource-intensive task. Various artificial intelligence (AI) approaches have been investigated to automate this process. OBJECTIVE. The purpose of this study was to evaluate performance of traditional machine learning (ML) models, bidirectional encoder representations from transformers (BERT) models, and large language models (LLMs) for automated CT and MRI protocoling. EVIDENCE ACQUISITION. MEDLINE, Embase, Scopus, Web of Science, IEEE Xplore, and Google Scholar databases were searched through July 2025 for studies reporting the performance of an AI-based technique in assigning protocols for CT or MRI requisitions. Accuracy results were separately extracted for all models tested in each study and pooled using a random-effects meta-analysis. AI approaches were compared using Welch t tests. Common sources of error were qualitatively summarized. EVIDENCE SYNTHESIS. The final analysis included 23 studies, comprising 1,196,259 imaging requisitions. Requisition subspecialties included body imaging (n = 4), musculoskeletal imaging (n = 3), neuroradiology (n = 6), thoracic imaging (n = 1), and multiple subspecialties (n = 9). Sixteen studies evaluated traditional ML models, eight evaluated BERT models, and five evaluated LLMs. Task-specific model fine-tuning was performed in three studies for traditional ML models, all studies for BERT models, and one study for LLMs. The overall pooled protocoling accuracy was 85% (95% CI, 83-87%). The pooled accuracy was 83% (95% CI, 80-85%) for traditional ML models, 87% (95% CI, 85-89%) for BERT models, and 86% (95% CI, 83-89%) for LLMs; these pooled accuracies were not significantly different between any pairwise combination of the three AI approaches (all p > .05). Among 30 distinct models (14 traditional ML models, nine BERT models, seven LLMs), the top-10 performing models comprised two traditional ML models, six BERT models (including the top performing model [BioBERT, a biomedical-domain BERT; accuracy, 93%]), and two LLMs. Common sources of error included ambiguous requisition text, data imbalance yielding incorrect protocol assignments for low-volume protocols, the presence of multiple clinically reasonable protocols for given requisitions, and difficulty handling requisitions containing terms strongly associated with disparate protocols. CONCLUSION. The top-performing AI models for automated CT and MRI protocoling included predominantly fine-tuned BERT models. CLINICAL IMPACT. AI tools show strong potential to help streamline radiologist workflows, possibly through hybrid AI-radiologist approaches. Fine-tuned LLMs warrant further exploration. TRIAL REGISTRATION. PROSPERO identifier CRD420251088671.
- Conference Article
6
- 10.1109/globecom46510.2021.9686011
- Dec 1, 2021
Minimization of Drive Test (MDT) reports are a key enabler for Machine Learning (ML)-based zero-touch automation envisioned for emerging cellular networks. However, due to numerous factors, the MDT reports are spatially sparse in nature. This sparsity undermines the performance of ML models that are built on the MDT data to estimate and optimize network KPIs. In this paper, we present and evaluate a framework to address this challenge. We leverage generative models, specifically, Gener-ative Adversarial Networks (GAN) and Variational Autoencoders (VAE) to augment the sparse multi-dimensional MDT data. Unlike image data where the quality of synthetic images produced by the generative models can be evaluated visually, establishing the authenticity of tabular synthetic data is a more complex problem. We address this problem by leveraging a tripartite approach: 1) We use several statistical measures to quantify the resemblance of synthetic data with original data. 2) We compare the performance of an ensemble learning model trained on augmented data, with that of trained on original data only 3) We benchmark the performance of the generative models with several classical ML models. This analysis is carried out for varying levels of sparsity and reveals insights about robustness of generative models against training data sparsity as well as on suitability of various methods for evaluating the quality of the generated synthetic tabular data. Results show GAN performs considerably better compared to other approaches. The presented solution thus can be used to overcome the sparsity problem in MDT reports thereby enabling ML-based automation use cases.
- Research Article
1
- 10.62487/yyx99243
- Jan 27, 2024
- Web3 Journal: ML in Health Science
Aim: The majority of machine learning (ML) models in healthcare are built on retrospective data, much of which is collected without explicit patient consent for use in artificial intelligence (AI) and ML applications. The primary aim of this study was to evaluate whether clinicians and scientific researchers themselves consent to provide their own data for the training of ML models. Materials and Methods: This survey was conducted through an anonymous online survey, utilizing platforms such as Telegram, LinkedIn, and Viber. The target audience comprised specific international groups, primarily Russian, German, and English-speaking, of clinicians and scientific researchers. These participants ranged in their levels of expertise and experience, from beginners to veterans. The survey centered on a singular, pivotal question: “Do You Consent to the Use of Your Biological and Private Data for Training Machine Learning and AI Models?” Respondents had the option to choose from three responses: “Yes” and “No”. Results: The survey was conducted in January 2024. A total of 119 unique and verified individuals participated in the survey. The results revealed that only 50% of respondents (63 persons) expressed consent to provide their own data for the training of ML and AI models. Conclusion: In the development of ML and AI models, particularly open-source ones, it is crucial to ascertain whether participants are willing to provide their private data. While ML algorithms can transform the nature of data, it is important to remember that the primary owner of this data is the individual. Our findings show that in 50% of the cases, even participants from scientific research and clinical backgrounds – individuals typically accountable for ensuring data quality in AI and ML model development – do not consent to the use of their data in AI and ML settings. This highlights the need for more stringent consent processes and ethical considerations in the utilization of personal data in AI and ML research.