Abstract
One technique for creating visuals that correspond to textual descriptions is called "text-to-image generation." It affects a wide range of applications and research fields (e.g., photo-editing, photo-searching, art-making, computer-aided design, image reconstruction, captioning, and portrait drawing). With the development of text-to-image generation models, artificial intelligence (AI) has reached a turning point where robots are now able to convert human language into aesthetically beautiful and coherent images, creating new opportunities for creativity and innovation. The creation of stable diffusion models is one of this field's most noteworthy developments. These models provide a strong framework for producing realistic images that are semantically linked with the given textual descriptions. But even with their remarkable abilities, conventional text-to-image models frequently have serious shortcomings, especially when it comes to training timeframes and computing costs. These models can be costly and time-consuming to train because they usually need large amounts of processing power and long training times. The main goal of this work is to develop a better Stable Diffusion model to overcome these shortcomings and produce high-quality images from text. The suggested model will drastically cut down on training durations and processing needs without sacrificing the quality of the output photos. The proposed method shows that the fine-tuning of the Stable Diffusion model results in a considerable improvement in producing images that are more akin to the original. The results of the improved model denoted a lower FID score (212.52) when contrasted with the base model (251.22).
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have