Abstract

In the context of Co-Speech Gesture Generation, Vector-Quantized Variational Autoencoder (VQ-VAE) based methods have shown promising results by separating the generation process into two stages: learning discrete gesture priors via pretraining for gesture reconstruction, which encodes gesture into a discrete codebook, followed by learning the mapping between speech audio and gesture codebook indices. This design leverages pretraining of motion VQVAE with the motion reconstruction task to improve the quality of generated gestures. However, the vanilla VQVAE’s codebook often fails to encode both low-level and high-level gesture features adequately, resulting in limited reconstruction quality and generation performance. To address this, we propose the Hierarchical Discrete Audio-to-Gesture (HD-A2G), which innovates (i) a two-stage hierarchical codebook structure for capturing high-level and low-level gesture priors, enabling the reconstruction of gesture details. (ii) it further integrates high-level and low-level feature using an AdaIn layer, effectively enhancing the learning of gesture’s rhythm and content. (iii) it explicitly maps text and audio onset features to the appropriate levels of the codebook, ensuring learning accurate hierarchical associations for the generation stage. Experimental results on the BEAT and Trinity datasets demonstrate that HD-A2G outperform the baseline method in both pretrained gesture reconstruction and audio-conditioned gesture generation with a clear margin, achieving the state-of-the-art performance qualitatively and quantitatively.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.