Learning hierarchical discrete prior for co-speech gesture generation

Jian Zhang,Osamu Yoshie

doi:10.1016/j.neucom.2024.127831

Abstract

In the context of Co-Speech Gesture Generation, Vector-Quantized Variational Autoencoder (VQ-VAE) based methods have shown promising results by separating the generation process into two stages: learning discrete gesture priors via pretraining for gesture reconstruction, which encodes gesture into a discrete codebook, followed by learning the mapping between speech audio and gesture codebook indices. This design leverages pretraining of motion VQVAE with the motion reconstruction task to improve the quality of generated gestures. However, the vanilla VQVAE’s codebook often fails to encode both low-level and high-level gesture features adequately, resulting in limited reconstruction quality and generation performance. To address this, we propose the Hierarchical Discrete Audio-to-Gesture (HD-A2G), which innovates (i) a two-stage hierarchical codebook structure for capturing high-level and low-level gesture priors, enabling the reconstruction of gesture details. (ii) it further integrates high-level and low-level feature using an AdaIn layer, effectively enhancing the learning of gesture’s rhythm and content. (iii) it explicitly maps text and audio onset features to the appropriate levels of the codebook, ensuring learning accurate hierarchical associations for the generation stage. Experimental results on the BEAT and Trinity datasets demonstrate that HD-A2G outperform the baseline method in both pretrained gesture reconstruction and audio-conditioned gesture generation with a clear margin, achieving the state-of-the-art performance qualitatively and quantitatively.

Full Text