Existing diffusion models outperform generative models like Generative Adversarial Networks in image synthesis and editing. However, they struggle with high-precision image editing while preserving image details and the accuracy of editing instructions. To address these challenges, we propose a dual attention control method to achieve high-precision image editing. Our approach includes two key attention control modules: (1) cross-attention control module, which combines the cross-attention maps of the original and edited images through weighted parameters, ensures that the synthesized edited image retains the structure of the input image. (2) Self-attention control module, which varies based on the editing task, applied at “coarse” and “fine” layers, since the coarse layers help maintain input image details and the fine layers are better suited for style transformations. Experimental evaluations have demonstrated that our approach achieves excellent results in detail preservation, content consistency, visual realism, and semantic understanding, making it especially suitable for tasks requiring high-precision editing. Specifically, compared to the editing outcomes under no control conditions, the introduction of dual visual attention control has led to an increase of 6.19% in CLIP scores, a reduction of 29.3% in LPIPS, and a decrease of 24.7% in FID. These significant improvements not only validate the effectiveness of the dual attention control but also attest to the method’s substantial flexibility and adaptability across different scenarios. Notably, our approach is a zero-shot solution, requiring no user optimization or fine-tuning, facilitating real-world applications.
Read full abstract