Abstract
Clothed human modeling plays a crucial role in multimedia research, with applications spanning virtual reality, gaming, and fashion design. The goal is to learn clothed human dynamics from observations and then generate humans with high-fidelity clothing details for motion animation. Despite tremendous advancements in clothing shape analysis by existing approaches, the community still faces challenges in generating convincing visual effects of cloth dynamics, maintaining temporally smooth clothing details, and handling diverse clothing patterns. To address these challenges, we introduce ClothDiffuse, a temporal diffusion model that seamlessly integrates three key components into this task: temporal dynamics modeling, iterative refinement, and diversified generation. Our approach begins by using an encoder to extract high-level temporal features from input human body motions. These features are combined with a learnable pixel-aligned garment feature, serving as prior conditions for the shape decoder. The decoder then iteratively denoise Gaussian noise to produce clothing deformations over time on the input unclothed human bodies. To ensure that the results align with observations and adhere to physical plausibility for clothing shape inference, we propose two physics-inspired loss functions that preserve the intra-frame distances and inter-frame forces of clothing points. Additionally, the stochastic nature of the denoising process allows for the generation of diverse and plausible clothing shapes. Experiments show that our approach outperforms state-of-the-art methods in chamfer distance and visual effects, particularly for loose clothing such as dresses and skirts. Furthermore, our approach effectively adapts to out-of-domain clothing types and generate realistic clothes dynamics.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have