Harnessing Stable Diffusion Model for High-Resolution Text-to-Image Synthesis
This study focuses on turning written descriptions into high-quality pictures using powerful AI diffusion models. These models use iterative denoising, which begins with a noisy image and gradually refines it to produce realistic and coherent outputs that match the user-provided text.Pre-trained models, such as Stable Diffusion, are used for their efficiency in text- toimage generation. Fine-tuning on specialized datasets improves adaptability, allowing the system to handle a wide range of textual inputs, from straightforward descriptions to complicated prompts. Techniques such as latent space processing maximize computing efficiency while maintaining output quality.With an impressive 90 percent accuracy A U-Net architecture that incorporates attention processes enhances the model’s capacity to generate detailed and accurate pictures. Index Terms—AI Diffusion Models,Text-to- ImageGeneration,Stable-Diffusion,Latent Space Processing,U- Net Architecture,Attention Mechanisms,Image Synthesis,Generative AI,Frechet Inception Distance (FID),Visual Content Creation´
- Research Article
7
- 10.9734/ajrcos/2024/v17i12533
- Dec 13, 2024
- Asian Journal of Research in Computer Science
Generative AI has emerged as a transformative field within artificial intelligence, enabling the creation of new data that mimics real-world information and expands the boundaries of what machines can autonomously generate. This study discuss the various models of generative AI, focusing on Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Auto-Regressive models, each offering distinct approaches and strengths in data generation. VAEs excel in learning latent representations, making them ideal for applications like anomaly detection and data imputation. GANs, renowned for their high-quality image synthesis, have found extensive use in tasks ranging from text-to-image conversion to super-resolution. Auto-Regressive models, on the other hand, are particularly effective in sequential data generation, such as text generation, music composition, and time series prediction. The paper highlights key applications of these models across diverse domains, including image synthesis, text generation, drug discovery, and simulation tasks in fields like healthcare, finance, and entertainment. Additionally, the study emphasizes the evaluation metrics are also called the comparitive parameters crucial for assessing the performance of generative models, such as perceptual quality metrics, Inception Score (IS), and Fréchet Inception Distance (FID), which provide quantitative insights into the quality and diversity of generated data. This study employs a systematic methodology comprising a comprehensive literature review, strategic search queries, and thematic data synthesis to explore generative AI. Key areas of focus include models (VAE, GAN, auto-regressive, flow-based), applications, evaluation techniques, challenges, and recent advances. The analysis identifies emerging trends, novel methods, and critical gaps in the field. This study also compares the performance of three Gen –AI models along with the comparative parameters like data type, Data Type, Applications, Training Complexity, Output Quality, Interpretability, Limitations, Advantages, Computational Cost and Scalability. Generative AI raises ethical concerns, including biases in training data that perpetuate stereotypes and marginalization. It can be misused for harmful purposes like creating deepfakes or spreading misinformation, impacting trust and privacy. Questions of accountability and ownership arise when AI-generated content infringes on intellectual property or causes harm. Addressing these issues is essential for responsible AI deployment.
- Research Article
2
- 10.1088/1361-6560/ad611a
- Jul 19, 2024
- Physics in Medicine & Biology
Objective. Head and neck radiotherapy planning requires electron densities from different tissues for dose calculation. Dose calculation from imaging modalities such as MRI remains an unsolved problem since this imaging modality does not provide information about the density of electrons. Approach. We propose a generative adversarial network (GAN) approach that synthesizes CT (sCT) images from T1-weighted MRI acquisitions in head and neck cancer patients. Our contribution is to exploit new features that are relevant for improving multimodal image synthesis, and thus improving the quality of the generated CT images. More precisely, we propose a Dual branch generator based on the U-Net architecture and on an augmented multi-planar branch. The augmented branch learns specific 3D dynamic features, which describe the dynamic image shape variations and are extracted from different view-points of the volumetric input MRI. The architecture of the proposed model relies on an end-to-end convolutional U-Net embedding network. Results. The proposed model achieves a mean absolute error (MAE) of 18.76±5.167 in the target Hounsfield unit (HU) space on sagittal head and neck patients, with a mean structural similarity (MSSIM) of 0.95±0.09 and a Frechet inception distance (FID) of 145.60±8.38 . The model yields a MAE of 26.83±8.27 to generate specific primary tumor regions on axial patient acquisitions, with a Dice score of 0.73±0.06 and a FID distance equal to 122.58±7.55 . The improvement of our model over other state-of-the-art GAN approaches is of 3.8%, on a tumor test set. On both sagittal and axial acquisitions, the model yields the best peak signal-to-noise ratio of 27.89±2.22 and 26.08±2.95 to synthesize MRI from CT input. Significance. The proposed model synthesizes both sagittal and axial CT tumor images, used for radiotherapy treatment planning in head and neck cancer cases. The performance analysis across different imaging metrics and under different evaluation strategies demonstrates the effectiveness of our dual CT synthesis model to produce high quality sCT images compared to other state-of-the-art approaches. Our model could improve clinical tumor analysis, in which a further clinical validation remains to be explored.
- Research Article
1
- 10.18287/2412-6179-co-1371
- Dec 1, 2024
- Computer Optics
This paper investigates an effect of the attention mechanism on the accuracy of hyperspectral image segmentation by convolutional neural networks in agriculture. The study compares two modifications of neural network architectures: with and without the attention mechanism. The attention mechanism is implemented as two modules: position-based (PAM) and channel-based (CAM). The positional module (PAM) considers the global context using information about the spatial domain of the whole image. The channel module (CAM) in turn takes into account the information of all spectral components. L2Net and U-Net architectures are used for a comparative study. Modified versions with the addition of the attention mechanism are developed: L2AT-Net and ULAT-Net. The experimental results show that adding the attention mechanism to the U-Net and L2Net architectures increases the mean value of the F1 metric from 0.80 to 0.83 and from 0.74 to 0.78, respectively. The results show that the application of the attention mechanism can improve the quality of semantic segmentation of hyperspectral images.
- Conference Article
8
- 10.1117/12.2512905
- Mar 13, 2019
- Medical Imaging 2019: Computer-Aided Diagnosis
Segmentation of the left atrium and proximal pulmonary veins is an important clinical step for diagnosis of atrial fibrillation. However, the automatic segmentation of the left atrium from late gadolinium-enhanced magnetic resonance (LGE-MRI) images remains a challenging task due to differences in acquisition and large variability between individuals. Deep learning has shown to outperform traditional methodologies for segmentation in numerous tasks. A popular deep learning architecture for segmentation is the U-Net, which has shown promising results biomedical segmentation problems. Many newer network architectures have been proposed that leverage the base U-Net architecture such as attention U-Net, dense U-Net and residual U-Net. These models incorporate updated encoder blocks into the U-Net architecture to incrementally improve performance over the base U-Net. Currently, there is no comprehensive evaluation of performance between these models. In this study we (1) explore approaches for the segmentation of the left atrium based on different- Net architectures. (2) We compare and evaluate these on the STACOM 2018 Atrial Segmentation Challenge dataset and (3) ensemble these models to improve overall segmentation by reducing the internal variance between models and architectures. (4) Lastly, we define and build upon a U-Net framework to simplify development of novel U-Net inspired architectures. Our ensemble achieves a mean Dice similarity coefficient (DSC) of 92.1 ± 2.0% on a test set of twenty 3D LGE-MRI images, outperforming other fully automatic segmentation methodologies.
- Research Article
6
- 10.3390/drones7030160
- Feb 25, 2023
- Drones
Benefiting from the development of unmanned aerial vehicles (UAVs), the types and number of datasets available for image synthesis have greatly increased. Based on such abundant datasets, many types of virtual scenes can be created and visualized using image synthesis technology before they are implemented in the real world, which can then be used in different applications. To achieve a convenient and fast image synthesis model, there are some common issues such as the blurred semantic information in the normalized layer and the local spatial information of the feature map used only in the generation of images. To solve such problems, an improved image synthesis model, SYGAN, is proposed in this paper, which imports a spatial adaptive normalization module (SPADE) and a sparse attention mechanism YLG on the basis of generative adversarial network (GAN). In the proposed model SYGAN, the utilization of the normalization module SPADE can improve the imaging quality by adjusting the normalization layer with spatially adaptively learned transformations, while the sparsified attention mechanism YLG improves the receptive field of the model and has less computational complexity which saves training time. The experimental results show that the Fréchet Inception Distance (FID) of SYGAN for natural scenes and street scenes are 22.1, 31.2; the Mean Intersection over Union (MIoU) for them are 56.6, 51.4; and the Pixel Accuracy (PA) for them are 86.1, 81.3, respectively. Compared with other models such as CRN, SIMS, pix2pixHD and GauGAN, the proposed image synthesis model SYGAN has better performance and improves computational efficiency.
- Research Article
4
- 10.1609/aaai.v39i20.35504
- Apr 11, 2025
- Proceedings of the AAAI Conference on Artificial Intelligence
Recently, the advent of generative AI technologies has made transformational impacts on our daily lives, yet its application in scientific applications remains in its early stages. Data scarcity is a major, well-known barrier in data-driven scientific computing, so physics-guided generative AI holds significant promise. In scientific computing, most tasks study the conversion of multiple data modalities to describe physical phenomena, for example, spatial and waveform in seismic imaging, time and frequency in signal processing, and temporal and spectral in climate modeling; as such, multi-modal pairwise data generation is highly required instead of single-modal data generation, which is usually used in natural images (e.g., faces, scenery). Moreover, in real-world applications, the unbalance of available data in terms of modalities commonly exists; for example, the spatial data (i.e., velocity maps) in seismic imaging can be easily simulated, but real-world seismic waveform is largely lacking. While the most recent efforts enable the powerful diffusion model to generate multi-modal data, how to leverage the unbalanced available data is still unclear. In this work, we use seismic imaging in subsurface geophysics as a vehicle to present "UB-Diff", a novel diffusion model for multi-modal paired scientific data generation. One major innovation is a one-in-two-out encoder-decoder network structure, which can ensure pairwise data is obtained from a co-latent representation. Then, the co-latent representation will be used by the diffusion process for pairwise data generation. Experimental results on the OpenFWI dataset show that UB-Diff significantly outperforms existing techniques in terms of Fréchet Inception Distance (FID) score and pairwise evaluation, indicating the generation of reliable and useful multi-modal pairwise data.
- Research Article
- 10.32628/ijsrst25126336
- Nov 23, 2025
- International Journal of Scientific Research in Science, Engineering and Technology
Generative AI is now a key way to create realistic fake images in areas like design, healthcare, and data improvement. There are lots of generative models out there, but how well they work depends on how they're built and trained. This study looks at three main types: Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Diffusion Models. Each model was trained with common datasets and tested using standard measures like Inception Score (IS), Fréchet Inception Distance (FID), Structural Similarity Index (SSIM), and user ratings. The results show that diffusion models consistently produce the best images, capturing fine details and making sure the images look right. GANs can make sharp images quickly, but they can be unstable. VAEs learn steadily, but their images often aren't as sharp. This comparison points out the good and bad of each model, helping people pick the best one for their needs.
- Conference Article
4
- 10.1109/cict48419.2019.9066259
- Dec 1, 2019
Automatic analysis of histopathology specimens images can be utilized in early extraction and detection of diseases such brain tumor, breast malignancy, colon cancer etc. The early detection of cancer may allow patients to take proper treatment. In this paper, an automatic cell nuclei segmentation based on deep learning strategies using 2- $D$ histological images is proposed. In the proposed approach U-Net architecture is used and its hyper parameters are tuned to segment the cell nuclei. The proposed solution is built upon the highly adaptive nature of U - Net architecture. The task of nuclei segmentation in the proposed approach includes detection of nuclei in an image and extracting the foreground, while segmenting the connected foreground area into separated nuclei masks. In the experimental results the proposed approach is tested using the dataset having histopathological cell images of breast cancer. The results shows that the proposed deep learning based approach achieved the 86 % average accuracy in segmentation of cell nuclei and also outperforms the other deep learning architectures.
- Book Chapter
4
- 10.1007/978-3-030-30493-5_47
- Jan 1, 2019
With the development of generative models, image synthesis conditioned on the specific variable becomes an important research theme gradually. This paper presents a novel spectral normalization based Hybrid Attentional Generative Adversarial Networks (HAGAN) for text to image synthesis. The hybrid attentional mechanism is composed of text-image cross-modal attention and self-attention of image sub regions. Cross-modal attention mechanism contributes to synthesize more fine-grained and text-related image by introducing word-level semantic information in generative model. The self-attention solves the long distance reliance of image local-region features when generate image. With spectral normalization, the training of GANs become more stable than traditional GANs, which conduces to avoid model collapse and gradient vanishing or explosion. We conduct experiments on widely used Oxford-102 flower dataset and CUB bird dataset to validate our proposed method. During quantitative and non-quantitative experimental comparison, the results indicate that the proposed method achieves the best performance on Inception score (IS), Frechet Inception Distance (FID) and visual effect.
- Research Article
5
- 10.1016/j.fraope.2024.100182
- Nov 17, 2024
- Franklin Open
In this study, we propose a modified version of the widely used UNet architecture, enhanced by the integration of recurrent blocks at each step of the encoder (down-sampling) and decoder (up-sampling) stages. The proposed Recurrent UNet (R-UNet) architecture aims to improve the performance of semantic segmentation tasks by allowing the model to capture temporal dependencies and long-range contextual information. The R-UNet architecture consists of two main components: a recurrent encoder and a recurrent decoder. The recurrent encoder is composed of a series of convolutional and recurrent blocks, which extract features from the input image and propagate them across time. The recurrent decoder consists of a similar series of convolutional and recurrent blocks, which use the extracted features to generate the final segmentation mask. An attention mechanism is employed to enhance feature extraction at the bottleneck of the model. The proposed R-UNet architecture is evaluated on multiple benchmark datasets, including those for liver segmentation, brain tumor detection, mitochondria segmentation, lung imaging, a proprietary lung CT COVID-19 dataset, as well as various multi-organ imaging datasets. The experimental results demonstrate that the proposed R-UNet architecture outperforms the standard UNet architecture and several other state-of-the-art semantic segmentation models in terms of accuracy score, achieving an overall accuracy of 97.2 % on the Mitochondria dataset, 97.83 % on the Liver dataset, 89.17 % on the Tumor dataset and 97.22 % Lung dataset.
- Research Article
3
- 10.3390/diagnostics15081041
- Apr 19, 2025
- Diagnostics (Basel, Switzerland)
Background/Objectives: Multiple sclerosis (MS) is an autoimmune disease that damages the myelin sheath of the central nervous system, which includes the brain and spinal cord. Although MS lesions in the brain are more frequently investigated, MS lesions in the cervical spinal cord (CSC) can be much more specific for the diagnosis of the disease. Furthermore, as lesion burden in the CSC is directly related to disease progression, the presence of lesions in the CSC may help to differentiate MS from other neurological diseases. Methods: In this study, two novel deep learning models based on fractal architectures are proposed for the automatic detection and segmentation of MS lesions in the CSC by improving the convolutional and connection structures used in the layers of the U-Net architecture. In our previous study, we introduced the FractalSpiNet architecture by incorporating fractal convolutional block structures into the U-Net framework to develop a deeper network for segmenting MS lesions in the CPC. In this study, to improve the detection of smaller structures and finer details in the images, an attention mechanism is integrated into the FractalSpiNet architecture, resulting in the Att-FractalSpiNet model. In addition, in the second hybrid model, a fractal convolutional block is incorporated into the skip connection structure of the U-Net architecture, resulting in the development of the Con-FractalU-Net model. Results: Experimental studies were conducted using U-Net, FractalSpiNet, Con-FractalU-Net, and Att-FractalSpiNet architectures to detect the CSC region and the MS lesions within its boundaries. In segmenting the CSC region, the proposed Con-FractalU-Net architecture achieved the highest Dice Similarity Coefficient (DSC) score of 98.89%. Similarly, in detecting MS lesions within the CSC region, the Con-FractalU-Net model again achieved the best performance with a DSC score of 91.48%. Conclusions: For segmentation of the CSC region and detection of MS lesions, the proposed fractal-based Con-FractalU-Net and Att-FractalSpiNet architectures achieved higher scores than the baseline U-Net architecture, particularly in segmenting small and complex structures.
- Research Article
13
- 10.1093/humrep/deae064
- Apr 10, 2024
- Human Reproduction (Oxford, England)
STUDY QUESTIONCan generative artificial intelligence (AI) models produce high-fidelity images of human blastocysts?SUMMARY ANSWERGenerative AI models exhibit the capability to generate high-fidelity human blastocyst images, thereby providing substantial training datasets crucial for the development of robust AI models.WHAT IS KNOWN ALREADYThe integration of AI into IVF procedures holds the potential to enhance objectivity and automate embryo selection for transfer. However, the effectiveness of AI is limited by data scarcity and ethical concerns related to patient data privacy. Generative adversarial networks (GAN) have emerged as a promising approach to alleviate data limitations by generating synthetic data that closely approximate real images.STUDY DESIGN, SIZE, DURATIONBlastocyst images were included as training data from a public dataset of time-lapse microscopy (TLM) videos (n = 136). A style-based GAN was fine-tuned as the generative model.PARTICIPANTS/MATERIALS, SETTING, METHODSWe curated a total of 972 blastocyst images as training data, where frames were captured within the time window of 110–120 h post-insemination at 1-h intervals from TLM videos. We configured the style-based GAN model with data augmentation (AUG) and pretrained weights (Pretrained-T: with translation equivariance; Pretrained-R: with translation and rotation equivariance) to compare their optimization on image synthesis. We then applied quantitative metrics including Fréchet Inception Distance (FID) and Kernel Inception Distance (KID) to assess the quality and fidelity of the generated images. Subsequently, we evaluated qualitative performance by measuring the intelligence behavior of the model through the visual Turing test. To this end, 60 individuals with diverse backgrounds and expertise in clinical embryology and IVF evaluated the quality of synthetic embryo images.MAIN RESULTS AND THE ROLE OF CHANCEDuring the training process, we observed consistent improvement of image quality that was measured by FID and KID scores. Pretrained and AUG + Pretrained initiated with remarkably lower FID and KID values compared to both Baseline and AUG + Baseline models. Following 5000 training iterations, the AUG + Pretrained-R model showed the highest performance of the evaluated five configurations with FID and KID scores of 15.2 and 0.004, respectively. Subsequently, we carried out the visual Turing test, such that IVF embryologists, IVF laboratory technicians, and non-experts evaluated the synthetic blastocyst-stage embryo images and obtained similar performance in specificity with marginal differences in accuracy and sensitivity.LIMITATIONS, REASONS FOR CAUTIONIn this study, we primarily focused the training data on blastocyst images as IVF embryos are primarily assessed in blastocyst stage. However, generation of an array of images in different preimplantation stages offers further insights into the development of preimplantation embryos and IVF success. In addition, we resized training images to a resolution of 256 × 256 pixels to moderate the computational costs of training the style-based GAN models. Further research is needed to involve a more extensive and diverse dataset from the formation of the zygote to the blastocyst stage, e.g. video generation, and the use of improved image resolution to facilitate the development of comprehensive AI algorithms and to produce higher-quality images.WIDER IMPLICATIONS OF THE FINDINGSGenerative AI models hold promising potential in generating high-fidelity human blastocyst images, which allows the development of robust AI models as it can provide sufficient training datasets while safeguarding patient data privacy. Additionally, this may help to produce sufficient embryo imaging training data with different (rare) abnormal features, such as embryonic arrest, tripolar cell division to avoid class imbalances and reach to even datasets. Thus, generative models may offer a compelling opportunity to transform embryo selection procedures and substantially enhance IVF outcomes.STUDY FUNDING/COMPETING INTEREST(S)This study was supported by a Horizon 2020 innovation grant (ERIN, grant no. EU952516) and a Horizon Europe grant (NESTOR, grant no. 101120075) of the European Commission to A.S. and M.Z.E., the Estonian Research Council (grant no. PRG1076) to A.S., and the EVA (Erfelijkheid Voortplanting & Aanleg) specialty program (grant no. KP111513) of Maastricht University Medical Centre (MUMC+) to M.Z.E.TRIAL REGISTRATION NUMBERNot applicable.
- Research Article
- 10.2174/0115734056401610250827114351
- Sep 1, 2025
- Current medical imaging
This study explored a generative image synthesis method based on diffusion models, potentially providing a low-cost and high-efficiency training data augmentation strategy for medical artificial intelligence (AI) applications. The MedMNIST v2 dataset was utilized as a small-volume training dataset under low-performance computing conditions. Based on the characteristics of existing samples, new medical images were synthesized using the proposed annotated diffusion model. In addition to observational assessment, quantitative evaluation was performed based on the gradient descent of the loss function during the generation process and the Fréchet Inception Distance (FID), using various loss functions and feature vector dimensions. Compared to the original data, the proposed diffusion model successfully generated medical images of similar styles but with dramatically varied anatomic details. The model trained with the Huber loss function achieved a higher FID of 15.2 at a feature vector dimension of 2048, compared with the model trained with the L2 loss function, which achieved the best FID of 0.85 at a feature vector dimension of 64. The use of the Huber loss enhanced model robustness, while FID values indicated acceptable similarity between generated and real images. Future work should explore the application of these models to more complex datasets and clinical scenarios. This study demonstrated that diffusion model-based medical image synthesis is potentially applicable as an augmentation strategy for AI, particularly in situations where access to real clinical data is limited. Optimal training parameters were also proposed by evaluating the dimensionality of feature vectors in FID calculations and the complexity of loss functions.
- Research Article
- 10.1088/2631-8695/ae15d4
- Oct 21, 2025
- Engineering Research Express
Spinal metastases are a frequent complication among cancer patients and can lead to severe neurological deficits. Accurate, efficient, and interpretable diagnosis is critical for timely treatment. Traditional radiological assessments are time-consuming, subjective, and limited by inter-observer variability. To address these challenges, this work presents TriDx, a novel unified framework integrating Generative Adversarial Networks (GANs), Convolutional Neural Networks (CNNs), and Generative AI (GenAI) for comprehensive diagnosis and report generation. Using the Spine-Mets CT Segmentation dataset from The Cancer Imaging Archive (TCIA), TriDx enhances image quality, learns fine-grained features, and produces radiology-like diagnostic reports. Our approach outperforms baseline models in segmentation accuracy, classification robustness, and diagnostic interpretability, paving the way for scalable AI-driven solutions in cancer diagnostics. TriDx achieves superior quantitative performance, with a Dice Score of 0.894, Hausdorff Distance95 (HD95) of 4.9mm, and classification accuracy of 92.3%, significantly outperforming baseline methods such as standalone 3D U-Net (Dice 0.851) and GAN-enhanced U-Net (Dice 0.880). In generative assessment, TriDx reports a Fréchet Inception Distance (FID) of 14.2 and a BLEU score of 0.76 for report quality.
- Research Article
10
- 10.37547/tajet/volume06issue11-09
- Nov 22, 2024
- The American Journal of Engineering and Technology
This study investigates the effectiveness of generative models and traditional classification models in detecting fraud and anomalies within the retail banking sector. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) were evaluated for their capability to generate realistic synthetic transaction data and identify anomalies, achieving anomaly detection accuracies of 91.2% and 93.5%, respectively. These models were also assessed using Inception Score and Fréchet Inception Distance (FID), with GANs exhibiting superior data realism. Among classification models, Gradient Boosting Machines (GBM) demonstrated the best performance, achieving an accuracy of 96.3%, a precision of 93.5%, a recall of 91.4%, and an AUC-ROC of 97.2%. Random Forest and Logistic Regression also performed well, though with slightly lower metrics.