Adaptively Hierarchical Quantization Variational Autoencoder Based on Feature Decoupling and Semantic Consistency for Image Generation
The Vector Quantized Variational AutoEncoder (VQ-VAE) has shown great potential in image generation, especially the methods with hierarchical features. However, the lack of decoupling of structural information between hierarchical features leads to semantic inconsistencies and redundant structural features, resulting in incompatible outputs. In this study, we propose the Adaptively Hierarchical Quantization Variational AutoEncoder (AHQ-VAE) to generate high-fidelity images with a unified structure. To ensure the semantic consistency of continuous space, we employ the Spatially Consistent Semantic Embedding (SCSE) module to align the hierarchical features, while decoupling global structural information and local details. To ensure the consistency of discrete space, we introduce the Adaptive Bottom Quantizer (ABQ) to generate the quantized bottom codes consistent with quantized top codes, so that the local details can adapt to the global semantics. Extensive experiments demonstrate our approach can generate high-quality images with a unified structure.
- Research Article
2
- 10.1051/itmconf/20257002011
- Jan 1, 2025
- ITM Web of Conferences
In the realm of image creation, deep learning stands out as an effective and valuable machine learning technique. Deep learning can automatically learn the intrinsic features of images, reaching the goal of generating high-quality images by utilizing multi-layer neural network models. In recent years, deep learning-based image generation technology has made significant progress. This paper mainly introduces two main methods: generating adversarial network (GAN) and variational autoencoder (VAE). GAN has been widely used in image generation, image repair and other aspects. VAE has a good performance in image generation, image classification and so on. However, current image generation technologies still face problems such as diversity and insufficient authenticity. Based on the above problems, this paper analyzes the methods of improving and optimizing the mainstream image generation algorithm from the perspectives of improving and optimizing the loss function, improving the space modeling, revising the structure of both the generator and discriminator, while speeding up the training process. Furthermore, the performance of these methods in image generation tasks is compared, and the strengths and weaknesses of each approach are evaluated. Image generation has emerged as a prominent research area in contemporary academia, with a high possibility of exploration and practice.
- Research Article
1
- 10.3724/sp.j.1089.2022.19724
- Oct 1, 2022
- Journal of Computer-Aided Design & Computer Graphics
Generative adversarial network is widely used in the field of image generation, but it is easy to lose some image details in the process of image generation. In this paper a detail preserving image generation method based on semantic consistency is proposed to generate fine-grained images that contain more detailed features and improve the semantic consistency of image-text. Firstly, in order to fully explore the potential semantics in text description, feature extraction module is introduced to select important words and sentences, and extract semantic structure feature information between words and sentences. Secondly, the detail preserving module combined with attention mechanism is used to associate the image with the text in-formation, and effectively selects the regions corresponding to the given text. Finally, semantic loss and perceptual loss are utilized to optimize the image-text consistency at the word level and reduce the randomness of image generation. The experimental results show that the IS and FID indexes reach 4.77 and 15.47 on CUB dataset, and 35.56 and 27.63 on COCO dataset, respectively.
- Book Chapter
64
- 10.1137/1.9781611975673.71
- May 6, 2019
Variational auto-encoder (VAE) is a powerful unsupervised learning framework for image generation. One drawback of VAE is that it generates blurry images due to its Gaussianity assumption and thus L2 loss. To allow the generation of high quality images by VAE, we increase the capacity of decoder network by employing residual blocks and skip connections, which also enable efficient optimization. To overcome the limitation of L2 loss, we propose to generate images in a multi-stage manner from coarse to fine. In the simplest case, the proposed multi-stage VAE divides the decoder into two components in which the second component generates refined images based on the course images generated by the first component. Since the second component is independent of the VAE model, it can employ other loss functions beyond the L2 loss and different model architectures. The proposed framework can be easily generalized to contain more than two components. Experiment results on the MNIST and CelebA datasets demonstrate that the proposed multi-stage VAE can generate sharper images as compared to those from the original VAE.
- Research Article
- 10.1016/j.cmpb.2025.108909
- Sep 1, 2025
- Computer methods and programs in biomedicine
Fine-grained image generation with EEG multi-level semantics.
- Video Transcripts
- 10.48448/n98h-ah27
- Dec 29, 2020
- Underline Science Inc.
In recent years, various types of sensors observe the real world. Especially, weather sensors are densely installed all over the world to observe current weather situations at various places. However, weather signals such as the temperature or humidity obtained by weather sensors are intuitively difficult for humans to understand. On the other hand, images captured by typical RGB cameras can tell weather situations at the captured places in a more comprehensible way for humans; however, cameras are only installed at limited places and are not necessarily open to public due to privacy issues. In order to solve this problem, the goal of our work is to generate images which can tell weather situations at arbitrary time and locations. This can be realized by using a conditional generative adversarial network architecture that takes an image and a condition to transform the image accordingly to the condition. Training such image generator requires a large number of image and condition pairs as the training data. Although weather signals can be easily collected from weather sensors, collecting their spatially and temporally synchronized outdoor images is not easy. Thus, we propose a semi-supervised method for training the image generator. A relatively small number of pairs of an outdoor image and weather signals is collected, each from different web services, by considering their semantic consistency. The collected pairs are used to train a predictor for predicting weather signals from a given outdoor image. Then, the image generator is trained by using a large number of pairs of an outdoor image and pseudo weather signals predicted by the predictor as the training data.
- Research Article
- 10.54097/waybgz41
- Mar 13, 2024
- Highlights in Science, Engineering and Technology
Image generation has been a popular research task in the computer vision community, which aims to learn a distribution from a specific dataset and generate realistic images obeying this distribution. Thanks to the rapid development of deep learning technology, image generation models based on convolutional neural networks, especially generative adversarial networks (GANs) and variational autoencoders (VAEs), have become mainstream frameworks for image generation. However, in recent years, with the gradual deepening of the research on the denoising diffusion probability model (DDPM), the image generation technology based on DDPM has made new breakthroughs in accuracy and speed. Around the Diffusion Model, this paper introduces its latest research progress in image generation and derivative tasks. Specifically, this paper reviews the key techniques and basic theories of the diffusion model in detail. Then, the main research work, improvement mechanism and characteristics of the DDPM-based image generation method are summarized and summarized. This paper focuses on the basic structure and related applications of diffusion models, and evaluates some basic functions. Finally, the current problems and future development directions of image generation technology based on diffusion model are analyzed and summarized.
- Research Article
20
- 10.1088/1742-6596/1525/1/012077
- Apr 1, 2020
- Journal of Physics: Conference Series
The need for large scale and high fidelity simulated samples for the ATLAS experiment motivates the development of new simulation techniques. Building on the recent success of deep learning algorithms at interpolation as well as image generation, Variational Auto-Encoders and Generative Adversarial Networks are investigated for modeling the response of the electromagnetic calorimeter for photons in a central calorimeter region over a range of energies. The synthesized showers are compared to showers from a full detector simulation using Geant4. This study demonstrates the potential of using such algorithms for fast calorimeter simulation for the ATLAS experiment in the future.
- Research Article
7
- 10.1088/1742-6596/2066/1/012008
- Nov 1, 2021
- Journal of Physics: Conference Series
Variational Autoencoder (VAE), as a kind of deep hidden space generation model, has achieved great success in performance in recent years, especially in image generation. This paper aims to study image compression algorithms based on variational autoencoders. This experiment uses the image quality evaluation measurement model, because the image super-resolution algorithm based on interpolation is the most direct and simple method to change the image resolution. In the experiment, the first step of the whole picture is transformed by the variational autoencoder, and then the actual coding is applied to the complete coefficient. Experimental data shows that after encoding using the improved encoding method of the variational autoencoder, the number of bits required for the encoding symbol stream required for transmission or storage in the traditional encoding method is greatly reduced, and symbol redundancy is effectively avoided. The experimental results show that the image research algorithm using variational autoencoder for image 1, image 2, and image 3 reduces the time by 3332, 2637, and 1470 bit respectively compared with the traditional image research algorithm of self-encoding. In the future, people will introduce deep convolutional neural networks to optimize the generative adversarial network, so that the generative adversarial network can obtain better convergence speed and model stability.
- Research Article
35
- 10.1109/tmm.2021.3116416
- Jan 1, 2022
- IEEE Transactions on Multimedia
Text-to-Image (T2I) synthesis is a challenging task that aims to convert natural language descriptions to real images. It remains an open problem mainly due to the diversity of text descriptions, which poses a huge obstacle in generating vivid and relevant images. Moreover, the existing evaluation metrics in T2I synthesis are mainly used to evaluate the visual quality of the generated images, while the semantic consistency between the two modalities is often ignored. To address these issues, we present a novel <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Knowledge-Driven Generative Adversarial Network</i> , termed KD-GAN, and a new evaluation system, named <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Pseudo Turing Test</i> (PTT for short). Concretely, KD-GAN takes a further step in imitating the behavior of human painting, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , drawing an image according to reference knowledge. The introduction of reference knowledge in KD-GAN not only improves the quality of the generated images but also enhances the semantic consistency between them and the input texts. In addition, KD-GAN can also greatly avoid some flaws against common sense during image generation, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">e.g.</i> , skiing in the blue sky. The proposed PTT is an important supplement to the existing evaluation system of T2I synthesis. It includes a set of pseudo-experts of different multimedia tasks to evaluate the semantic consistency between the given texts and the generated images. To validate the proposed KD-GAN, we conducted extensive experiments on two benchmark datasets, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , Caltech-UCSD Birds (CUB), and MS-COCO (COCO). The experimental results demonstrate that KD-GAN outperforms state-of-the-art methods on IS, FID, and the proposed PTT metrics. <xref ref-type="fn" rid="fn1" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><sup>1</sup></xref> <fn id="fn1" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><label><sup>1</sup></label> The codes of KD-GAN are at [Online]. Available: <uri>https://github.com/pengjunn/KD-GAN</uri> and the codes and models of PTT are at [Online]. Available: <uri>https://github.com/pengjunn/PTT</uri>. </fn>
- Research Article
3
- 10.1016/j.dsp.2023.104105
- Jun 1, 2023
- Digital Signal Processing
GMF-GAN: Gradual multi-granularity semantic fusion GAN for text-to-image synthesis
- Research Article
33
- 10.3390/s23073457
- Mar 25, 2023
- Sensors (Basel, Switzerland)
In recent decades, the Variational AutoEncoder (VAE) model has shown good potential and capability in image generation and dimensionality reduction. The combination of VAE and various machine learning frameworks has also worked effectively in different daily life applications, however its possible use and effectiveness in modern game design has seldom been explored nor assessed. The use of its feature extractor for data clustering has also been minimally discussed in the literature neither. This study first attempts to explore different mathematical properties of the VAE model, in particular, the theoretical framework of the encoding and decoding processes, the possible achievable lower bound and loss functions of different applications; then applies the established VAE model to generate new game levels based on two well-known game settings; and to validate the effectiveness of its data clustering mechanism with the aid of the Modified National Institute of Standards and Technology (MNIST) database. Respective statistical metrics and assessments are also utilized to evaluate the performance of the proposed VAE model in aforementioned case studies. Based on the statistical and graphical results, several potential deficiencies, for example, difficulties in handling high-dimensional and vast datasets, as well as insufficient clarity of outputs are discussed; then measures of future enhancement, such as tokenization and the combination of VAE and GAN models, are also outlined. Hopefully, this can ultimately maximize the strengths and advantages of VAE for future game design tasks and relevant industrial missions.
- Research Article
1
- 10.54097/hset.v39i.6561
- Apr 1, 2023
- Highlights in Science, Engineering and Technology
Image generation has always been a study hotspot in machine learning, which aims to build models to learn specific semantic distributions from massive image data to generate realistic simulated images. Thanks to the deep learning technology’s quick development, generative models are constantly being developed and huge success has been achieved in image generation tasks. According to difference between generative models, the existing image generation methods based on deep learning can mainly be separated into three models: image generation based on Variational Autoencoder (VAE), image generation based on Generative Adversarial Network (GAN) and image generation combined the VAE and GAN. Focusing on the three frameworks, in this paper, the development process and related principles of each type of generation model are described respectively. After that, the different generation results of different generation models for the agreed training set are compared intuitively, the advantages and problems of various models are proposed, and reasonable improvement measures are proposed for some problems. Finally, the development prospects of various models are prospected.
- Research Article
1
- 10.1002/sdtp.16343
- Apr 1, 2023
- SID Symposium Digest of Technical Papers
Image as a medium of visual information transmission have the advantages of vividness, intuition and easy understanding. They play an important role in information transmission and utilization. In recent years, due to the rapid development of deep learning technology in the field of image processing, image generative model based on neural network has become one of the current research hotspots. In the field of deep learning, unsupervised learning model has received more and more attentions, especially in the field of deep generative models, which has made breakthrough progress [1] . Among them, Variational Auto-Encoder (VAE), Generative Adversarial Network (GAN), and Diffusion Model are the three most representative research methods in the field of unsupervised learning. They have been applied more and more in the field of deep generative models. Especially, the high-quality image generative models based on the generative adversarial network continue to be hot. The diffusion model is a rising star, which is favored by more and more researchers. This paper first summarizes the main research work, improvement mechanism and features of image generation methods based on VAE and GAN, then introduces the principle of the rising diffusion model and its representative models. Finally, the advantages and limitations of the above methods are compared and analyzed, and prospects for future research are put forward.
- Research Article
- 10.32473/flairs.38.1.139006
- May 14, 2025
- The International FLAIRS Conference Proceedings
Variational Autoencoders (VAEs) are popular Bayesianinference models that excel at approximating complexdata distributions in a lower-dimensional latent space.Despite their widespread use, VAEs frequently facechallenges in image generation, often resulting in blurryoutputs. This outcome is primarily attributed to twofactors: the inherent probabilistic nature of the VAEframework and the oversmoothing effect induced bythe Kullback-Leibler (KL) divergence term in the lossfunction. This paper explores the integration of Wasser-stein Distance into the VAEs framework, resulting ina Wasserstein Autoencoders (WAEs) designed to mit-igate the oversmoothing issue and enhance the qual-ity of generated images. We evaluated the proposedWAEs using the Fr´echet Inception Distance (FID), In-ception Score (IS) and Structural Similarity Index Mea-sure (SSIM). The experimental results in the CelebAdataset demonstrate that WAEs significantly outperformVAEs by 25% in FID, 13.6% in IS and 15.3% in SSIM.Additionally, the evaluation considers the issue of classimbalance in the ODIR dataset, where WAEs demon-strate superior accuracy and precision in classificationtasks. Our findings highlight WAEs as a practical andefficient alternative to VAEs for image generation andreconstruction, particularly in resource-limited settings
- Book Chapter
1
- 10.1007/978-3-030-61616-8_58
- Jan 1, 2020
Variational AutoEncoders (VAEs) are applied to many generation tasks while suffering from posterior collapse issue. Vector Quantization (VQ) is recently employed in VAE model on image generation, which could get rid of the posterior collapse problem and show its potentiality for more generation tasks. In this paper, the VQ method is applied to VAE on text generation. We elaborately design the model architecture to mitigate the index collapse issue brought in by VQ process. Experiments show that our text generation model can achieve better reconstruction and generation performance than other VAE based approaches.