Clip-Medfake: Synthetic Data Augmentation With AI-Generated Content for Improved Medical Image Classification

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Data augmentation is serving as a critical and fundamental technology to improve model generalization and performance in a wide spectrum of machine learning tasks. Despite the increasing interest in developing various pathways to artificially generate new data to reduce the overfitting issue during model training, enriching the diversity of training data in the field of medicine remains facing enormous challenges. By virtue of recent advancements in generative artificial intelligence, we present a novel data augmentation framework, CLIP-MedFake, to address the shortage of training data used in medical image classification. The proposed method first employs the Stable Diffusion model to generate new fake data based on a small amount of training data, and then adopts the paradigm of few-shot learning and uses the CLIP architecture as the backbone to pre-train the model with synthetic data and then fine-tune it with real medical images. Extensive experiment results on two publicly available datasets demonstrate the effectiveness of the proposed method in promoting medical image classification.

Similar Papers
  • Research Article
  • Cite Count Icon 27
  • 10.1002/mp.15118
Classification of focal liver lesions in CT images using convolutional neural networks with lesion information augmented patches and synthetic data augmentation.
  • Aug 4, 2021
  • Medical physics
  • Hansang Lee + 5 more

We propose a deep learning method that classifies focal liver lesions (FLLs) into cysts, hemangiomas, and metastases from portal phase abdominal CT images. We propose a synthetic data augmentation process to alleviate the class imbalance and the Lesion INformation Augmented (LINA) patch to improve the learning efficiency. A dataset of 502 portal phase CT scans of 1,290 FLLs was used. First, to alleviate the class imbalance and to diversify the training data patterns, we suggest synthetic training data augmentation using DCGAN-based lesion mask synthesis and pix2pix-based mask-to-image translation. Second, to improve the learning efficiency of convolutional neural networks (CNNs) for the small lesions, we propose a novel type of input patch termed the LINA patch to emphasize the lesion texture information while also maintaining the lesion boundary information in the patches. Third, we construct a multi-scale CNN through a model ensemble of ResNet-18 CNNs trained on LINA patches of various mini-patch sizes. The experiments demonstrate that (a) synthetic data augmentation method shows characteristics different but complementary to those in conventional real data augmentation in augmenting data distributions, (b) the proposed LINA patches improve classification performance compared to those by existing types of CNN input patches due to the enhanced texture and boundary information in the small lesions, and (c) through an ensemble of LINA patch-trained CNNs with different mini-patch sizes, the multi-scale CNN further improves overall classification performance. As a result, the proposed method achieved an accuracy of 87.30%, showing improvements of 10.81%p and 15.0%p compared to the conventional image patch-trained CNN and texture feature-trained SVM, respectively. The proposed synthetic data augmentation method shows promising results in improving the data diversity and class imbalance, and the proposed LINA patches enhance the learning efficiency compared to the existing input image patches.

  • Research Article
  • Cite Count Icon 8
  • 10.1148/ryai.230514
Addressing the Generalizability of AI in Radiology Using a Novel Data Augmentation Framework with Synthetic Patient Image Data: Proof-of-Concept and External Validation for Classification Tasks in Multiple Sclerosis.
  • Oct 16, 2024
  • Radiology. Artificial intelligence
  • Gianluca Brugnara + 14 more

Artificial intelligence (AI) models often face performance drops after deployment to external datasets. This study evaluated the potential of a novel data augmentation framework based on generative adversarial networks (GANs) that creates synthetic patient image data for model training to improve model generalizability. Model development and external testing were performed for a given classification task, namely the detection of new fluid-attenuated inversion recovery lesions at MRI during longitudinal follow-up of patients with multiple sclerosis (MS). An internal dataset of 669 patients with MS (n = 3083 examinations) was used to develop an attention-based network, trained both with and without the inclusion of the GAN-based synthetic data augmentation framework. External testing was performed on 134 patients with MS from a different institution, with MR images acquired using different scanners and protocols than images used during training. Models trained using synthetic data augmentation showed a significant performance improvement when applied on external data (area under the receiver operating characteristic curve [AUC], 83.6% without synthetic data vs 93.3% with synthetic data augmentation; P = .03), achieving comparable results to the internal test set (AUC, 95.0%; P = .53), whereas models without synthetic data augmentation demonstrated a performance drop upon external testing (AUC, 93.8% on internal dataset vs 83.6% on external data; P = .03). Data augmentation with synthetic patient data substantially improved performance of AI models on unseen MRI data and may be extended to other clinical conditions or tasks to mitigate domain shift, limit class imbalance, and enhance the robustness of AI applications in medical imaging. Keywords: Brain, Brain Stem, Multiple Sclerosis, Synthetic Data Augmentation, Generative Adversarial Network Supplemental material is available for this article. © RSNA, 2024.

  • Research Article
  • Cite Count Icon 64
  • 10.1109/lra.2021.3056355
Synthetic Biological Signals Machine-Generated by GPT-2 Improve the Classification of EEG and EMG Through Data Augmentation
  • Feb 4, 2021
  • IEEE Robotics and Automation Letters
  • Jordan J Bird + 4 more

Synthetic data augmentation is of paramount importance for machine learning classification, particularly for biological data, which tend to be high dimensional and with a scarcity of training samples. The applications of robotic control and augmentation in disabled and able-bodied subjects still rely mainly on subject-specific analyses. Those can rarely be generalised to the whole population and appear to over complicate simple action recognition such as grasp and release (standard actions in robotic prosthetics and manipulators). We show for the first time that multiple GPT-2 models can machine-generate synthetic biological signals (EMG and EEG) and improve real data classification. Models trained solely on GPT-2 generated EEG data can classify a real EEG dataset at 74.71% accuracy and models trained on GPT-2 EMG data can classify real EMG data at 78.24% accuracy. Synthetic and calibration data are then introduced within each cross validation fold when benchmarking EEG and EMG models. Results show algorithms are improved when either or both additional data are used. A Random Forest achieves a mean 95.81% (1.46) classification accuracy of EEG data, which increases to 96.69% (1.12) when synthetic GPT-2 EEG signals are introduced during training. Similarly, the Random Forest classifying EMG data increases from 93.62% (0.8) to 93.9% (0.59) when training data is augmented by synthetic EMG signals. Additionally, as predicted, augmentation with synthetic biological signals also increases the classification accuracy of data from new subjects that were not observed during training. A Robotiq 2F-85 Gripper was finally used for real-time gesture-based control, with synthetic EMG data augmentation remarkably improving gesture recognition accuracy, from 68.29% to 89.5%.

  • Research Article
  • Cite Count Icon 1935
  • 10.1016/j.neucom.2018.09.013
GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification
  • Sep 21, 2018
  • Neurocomputing
  • Maayan Frid-Adar + 5 more

GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification

  • Book Chapter
  • Cite Count Icon 7
  • 10.1016/b978-0-443-19413-9.00026-6
Chapter 5 - Synthetic medical image augmentation: a GAN-based approach for melanoma skin lesion classification with deep learning
  • Jan 1, 2023
  • Deep Learning in Personalized Healthcare and Decision Support
  • V Nirmala + 1 more

Chapter 5 - Synthetic medical image augmentation: a GAN-based approach for melanoma skin lesion classification with deep learning

  • Research Article
  • Cite Count Icon 14
  • 10.1038/s41598-022-22222-z
Automation of generative adversarial network-based synthetic data-augmentation for maximizing the diagnostic performance with paranasal imaging
  • Oct 27, 2022
  • Scientific Reports
  • Hyoun-Joong Kong + 9 more

Thus far, there have been no reported specific rules for systematically determining the appropriate augmented sample size to optimize model performance when conducting data augmentation. In this paper, we report on the feasibility of synthetic data augmentation using generative adversarial networks (GAN) by proposing an automation pipeline to find the optimal multiple of data augmentation to achieve the best deep learning-based diagnostic performance in a limited dataset. We used Waters’ view radiographs for patients diagnosed with chronic sinusitis to demonstrate the method developed herein. We demonstrate that our approach produces significantly better diagnostic performance parameters than models trained using conventional data augmentation. The deep learning method proposed in this study could be implemented to assist radiologists in improving their diagnosis. Researchers and industry workers could overcome the lack of training data by employing our proposed automation pipeline approach in GAN-based synthetic data augmentation. This is anticipated to provide new means to overcome the shortage of graphic data for algorithm training.

  • Research Article
  • Cite Count Icon 60
  • 10.1016/j.ultras.2023.107041
A review of synthetic and augmented training data for machine learning in ultrasonic non-destructive evaluation
  • May 18, 2023
  • Ultrasonics
  • Sebastian Uhlig + 4 more

Ultrasonic Testing (UT) has seen increasing application of machine learning (ML) in recent years, promoting higher-level automation and decision-making in flaw detection and classification. Building a generalized training dataset to apply ML in non-destructive evaluation (NDE), and thus UT, is exceptionally difficult since data on pristine and representative flawed specimens are needed. Yet, in most UT test cases flawed specimen data is inherently rare making data coverage the leading problem when applying ML. Common data augmentation (DA) strategies offer limited solutions as they don’t increase the dataset variance, which can lead to overfitting of the training data. The virtual defect method and the recent application of generative adversarial neural networks (GANs) in UT are sophisticated DA methods targeting to solve this problem. On the other hand, well-established research in modeling ultrasonic wave propagations allows for the generation of synthetic UT training data. In this context, we present a first thematic review to summarize the progress of the last decades on synthetic and augmented UT training data in NDE. Additionally, an overview of methods for synthetic UT data generation and augmentation is presented. Among numerical methods such as finite element, finite difference, and elastodynamic finite integration methods, semi-analytical methods such as general point source synthesis, superposition of Gaussian beams, and the pencil method as well as other UT modeling software are presented and discussed. Likewise, existing DA methods for one- and multidimensional UT data, feature space augmentation, and GANs for augmentation are presented and discussed. The paper closes with an in-detail discussion of the advantages and limitations of existing methods for both synthetic UT training data generation and DA of UT data to aid the decision-making of the reader for the application to specific test cases.

  • Conference Article
  • 10.1145/3440084.3441178
Infrared Pedestrian Detection Based on GAN Data Augmentation
  • Nov 17, 2020
  • Jinda Hu + 2 more

Object detection, as an important branch of computer vision, has been widely studied in recent years. However, the lack of large labeled dataset obstructs the usage of convolutional neural networks (CNN) for detecting in thermal infrared (TIR) images. Most existing dataset focus on visible images, while thermal infrared images are helpful for detection even in a dark environment. To address this problem, we propose to use image-to-image translation models. These models allow us to translate the available labeled visible images to synthetic infrared images. Based on the original pedestrian dataset CVC-09, we use the pedestrian dataset CVC-14 to generate some labeled pedestrian infrared images. Finally, we compare original dataset with classic data augmentation and synthetic data augmentation training CNN. In addition, we explore the quality of synthetic TIR images using contrast experiments. The average precision of detection using classic data augmentation alone is 79.18%. By adding synthetic data augmentation, the average precision has improved to 82.24%. We believe that this method of synthetic data augmentation can be extended to other infrared detection applications and achieve other breakthroughs.

  • Book Chapter
  • Cite Count Icon 18
  • 10.1007/978-3-030-00320-3_4
Generation of Amyloid PET Images via Conditional Adversarial Training for Predicting Progression to Alzheimer’s Disease
  • Jan 1, 2018
  • Yu Yan + 3 more

New positron emission tomography (PET) tracers could have a substantial impact on early diagnosis of Alzheimer’s disease (AD) and mild cognitive impairment (MCI) progression, particularly if they are accompanied by optimised deep learning methods. To realize the full potential of deep learning for PET imaging, large datasets are required for training. However, dataset sizes are restricted due to limited availability. Meanwhile, most of the AD classification studies have been based on structural MRI rather than PET. In this paper, we propose a novel application of conditional Generative Adversarial Networks (cGANs) to the generation of \( ^{18} F \)-florbetapir PET images from corresponding MRI images. Furthermore, we show that generated PET images can be used for synthetic data augmentation, and improve the performance of 3D Convolutional Neural Networks (CNN) for predicting progression to AD. Our method is applied to a dataset of 79 PET images, obtained from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. We generate high quality PET images from corresponding MRIs using cGANs, and we evaluate the quality of generated PET images by comparison to real images. We then use the trained cGANs to generate synthetic PET images from additional MRI dataset. Finally we build a 152-layer ResNet to compare the MCI classification performance using both traditional data augmentation method and our proposed synthetic data augmentation method. Mean Structural Similarity (SSIM) index was 0.95 ± 0.05 for generated PET and real PET. For MCI progression classification, the traditional data augmentation method showed 75% accuracy while the synthetic data augmentation improved this to 82%.

  • Research Article
  • 10.1177/18758967251385031
Optimizing Hate Speech Detection in Malayalam-English Code-Mixed Text: Handling Women's Abuse by Synthetic Data Augmentation
  • Dec 11, 2025
  • Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology
  • Dhanya Lk + 1 more

The increase of online hate speech, especially against women, has become a big problem in digital communication, especially in low-resource languages like Malayalam and in situations where English is mixed with other languages. This work examines the efficacy of synthetic data augmentation techniques—Machine Translation (MT), Masked Language Modeling (MLM), and Few-Shot Learning (FSL)—in enhancing hate speech identification inside Malayalam-English (Manglish) social media text. We use these three methodologies to improve transformer-based models like mBERT, BERT, and IndicBERT. Our experiments show that classification performance has improved a lot. For example, mBERT got an F1-score of 86.42%, but real data only got 81.24%. LIME's explainability research indicates that contextual clues, not just offending words on their own, are what make detection accurate. Also, synthetic data makes things more fair by cutting down on false positives and false negatives and makes models more broad by exposing them to a larger range of code-mixed expressions. The approach is effective, but it has certain drawbacks. For example, it may be hard to apply to other code-mixed languages or fields, and there are ethical issues with creating synthetic data. The results have practical consequences for implementing fairness-aware, transparent, and resilient hate speech detection algorithms on multilingual social media platforms. This is the first study we know of that looks into synergistic synthetic data augmentation for detecting hate speech that mixes languages, with the goal of reducing online harassment of women.

  • Research Article
  • 10.64898/2026.02.05.703825
SpliceRead: Improving Canonical and Non-Canonical Splice Site Prediction with Residual Blocks and Synthetic Data Augmentation.
  • Feb 9, 2026
  • bioRxiv : the preprint server for biology
  • Sahil Thapa + 3 more

Accurate splice site prediction is fundamental to understanding gene expression and its associated disorders. However, most existing models are biased toward frequent canonical sites, limiting their ability to detect rare but biologically important non-canonical variants. These models often rely heavily on large, imbalanced datasets that fail to capture the sequence diversity of non-canonical sites, leading to high false-negative rates. Here, we present SpliceRead, a novel deep learning model designed to improve the classification of both canonical and non-canonical splice sites using a combination of residual convolutional blocks and synthetic data augmentation. SpliceRead employs a data augmentation method to generate diverse non-canonical sequences and uses residual connections to enhance gradient flofw and capture subtle genomic features. Trained and tested on a multi-species dataset of 400- and 600-nucleotide sequences, SpliceRead consistently outperforms state-of-the-art models across all key metrics, including F1-score, accuracy, precision, and recall. Notably, it achieves a substantially lower non-canonical misclassification rate than baseline methods. Extensive evaluations, including cross-validation, cross-species testing, and input-length generalization, confirm its robustness and adaptability. SpliceRead offers a powerful, generalizable framework for splice site prediction, particularly in challenging, low-frequency sequence scenarios, and paves the way for more accurate gene annotation in both model and non-model organisms. The open-sourced code of SpliceRead and a detailed documentation is available at The open-sourced code of SpliceRead and detailed documentation are available at https://github.com/OluwadareLab/SpliceRead.

  • Research Article
  • 10.1049/itr2.70104
Enriched Pedestrian Crossing Prediction Using Carla Synthetic Data
  • Jan 1, 2025
  • IET Intelligent Transport Systems
  • Mohsen Azarmi + 2 more

Pedestrian crossing prediction, which involves anticipating whether a pedestrian will cross the street or not, is a crucial function in autonomous driving systems. This is also a safety requirement for the interaction of highly automated vehicles and pedestrians. The endeavours in this research domain heavily rely on processing videos captured by the frontal cameras of autonomous vehicles using advanced computer vision techniques and deep learning methods. While recent studies focus on the model architecture for crossing prediction by utilising pre‐trained visual feature extractors, they often encounter challenges stemming from inaccurate input features such as pedestrian body pose and/or scene semantic information. In this study, we aim to enhance pose estimation and semantic segmentation algorithms by using synthetic data augmentation (SDA) and domain randomisation (DR) techniques. SDA allows for automatic annotations through predefined agents and objects in a simulated urban environment. However, it creates a domain gap between synthetic and real‐world data. To tackle this, we introduce a DR technique to generate synthetic data mimicking various weather and ambient illumination conditions. We evaluated two training strategies on six algorithms for both pose estimation and semantic segmentation algorithms, and ultimately, we target four deep learning architectures for crossing prediction, including convolutional, recurrent, graph, and transformer neural networks. The proposed technique improves the extraction of pedestrian body pose and categorical semantic information, which in turn enhances the state‐of‐the‐art. This results in effective feature selection as the input for the PIP task, improving prediction accuracy by 3.2%, 4.2%, and 6.3% to reach 87.6%, 92.2%, and 73.6% against the JAAD, PIE, and FU‐PIP datasets, respectively. The study indicates that using a simulated environment with structural randomised properties can enhance the resilience of the pedestrian crossing prediction to variations in the input data.

  • Research Article
  • Cite Count Icon 15
  • 10.1016/j.ibneur.2023.12.002
Evaluating synthetic neuroimaging data augmentation for automatic brain tumour segmentation with a deep fully-convolutional network
  • Dec 14, 2023
  • IBRO Neuroscience Reports
  • Fawad Asadi + 2 more

Evaluating synthetic neuroimaging data augmentation for automatic brain tumour segmentation with a deep fully-convolutional network

  • Research Article
  • 10.1080/09507116.2025.2539827
Deep learning based surface defect detection improvement through synthetic data augmentation with GANs
  • Aug 2, 2025
  • Welding International
  • Kumar Parmar + 1 more

This study presents an advanced approach for weld defect classification in radiographic images by integrating deep learning with data augmentation using Conditional Generative Adversarial Networks (cGANs). By addressing challenges associated with insufficiently annotated data, the proposed method significantly improves model generalizability and classification accuracy. The experimental results demonstrate a notable accuracy boost from 88.94% to 95.88% after augmentation, highlighting the impact of synthetic data generation. The stability of loss curves further validates the method’s effectiveness in minimizing discrepancies between real and synthetic data. Comparative analysis with state-of-the-art deep learning models, including ResNet-50, VGG16, and DenseNet-121, confirms its superior performance. This research underscores the potential of deep learning-driven synthetic data augmentation in enhancing weld defect detection, contributing to improved quality control in industrial welding operations.

  • Research Article
  • 10.65521/ijacect.v13i1.61
Deep Generative Models for Synthetic Data Generation and Augmentation
  • Mar 19, 2025
  • International Journal on Advanced Computer Engineering and Communication Technology
  • Ekaterina Katya + 1 more

The growing demand for large-scale, high-quality datasets in fields such as machine learning, artificial intelligence, and medical research has prompted the exploration of synthetic data generation techniques. Deep generative models, including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and normalizing flows, have shown great promise in generating realistic data across various domains. This paper provides an in-depth review of these models, highlighting their applications in synthetic data generation and augmentation. We discuss the principles, advancements, and challenges associated with deep generative models, including issues such as mode collapse, training instability, and the need for domain-specific adaptations. Furthermore, we explore the role of synthetic data in improving model robustness, enhancing privacy, and addressing data scarcity in sensitive areas like healthcare and autonomous driving. We conclude by outlining future directions for research, emphasizing the integration of generative models with other data augmentation techniques to further advance their applicability and efficiency.

Save Icon
Up Arrow
Open/Close