Affordance-aware human insertion is a controllable human synthesis task aimed at seamlessly integrating a person into a scene while aligning human pose with contextual scene affordance and preserving human visual identity. Previous methods, typically reliant on a general framework of inpainting that injects all conditional information into a single branch, often struggle with the complexities of real-world contexts and the nuanced attributes of human figures. To this end, we present a novel DIS entangled dual-branch framework for A ffordance-aware human insertion task, termed as DISA, which focuses on both scene context comprehension and precise person attribute extraction. Specifically, our dual-branch design facilitates diffusion models to ensure disentangled and precise manipulations: one branch utilizes an additional network for deep scene context comprehension and control, while the other branch employs a parallel encoder to extract the feature of the reference person and injects this information through cross-attention mechanism. Furthermore, to comprehensively evaluate affordance-aware human insertion task, we introduce a new metric to assess the preservation of visual identity. We conduct a broad variety of evaluation experiments and validate the diversity and robustness of our method in different settings and downstream applications. Both qualitative and quantitative experimental analysis demonstrates that our approach outperforms previous methods in terms of image quality, pose accuracy and visual identity preservation.
Read full abstract