To achieve content-consistent results in text-conditioned image editing, existing methods typically employ a reconstruction branch to capture the source image details via diffusion inversion and a generation branch to synthesize the target image based on the given textual prompt and the masked source image details. However, accurately segmenting source details is challenging with the current fixed-threshold mask strategy. Additionally, the inadequacies in the inversion process can lead to insufficient retention of source details. In this paper, we propose a method called SAMControl ( S oft A ttention M ask) to adaptively control the pose and object details for image editing. SAMControl dynamically learns flexible attention masks for different images at various diffusion steps. Furthermore, in the reconstruction branch, we utilize a direct inversion technique to ensure the fidelity of source details within SAM. Extensive qualitative and quantitative results demonstrate the effectiveness of the proposed method.
Read full abstract