Abstract

Co-speech gesture synthesis is a challenging task due to the complexity and uncertainty between gestures and speech. Gestures that accompany speech (i.e., Co-Speech Gesture) are an essential part of natural and efficient embodied human communication, as they work in tandem with speech to convey information more effectively. Although data-driven approaches have improved gesture synthesis, existing deep learning-based methods use deterministic modeling which could lead to averaging out predicted gestures. Additionally, these methods lack control over gesture generation such as user editing of generated results. In this paper, we propose an editable gesture synthesis method based on a learned pose script, which disentangles gestures into individual representative and rhythmic gestures to produce high-quality, diverse and realistic poses. Specifically, we first detect the time of occurrence of gestures in video sequences and transform them into pose scripts. Regression models are then built to predict the pose scripts. Next, learned pose scripts are used for gesture synthesis, while rhythmic gestures are modeled using a variational auto-encoder and a one-dimensional convolutional network. Moreover, we introduce a large-scale Chinese co-speech gesture synthesis dataset with multimodal annotations for training and evaluation, which will be publicly available to facilitate future research. The proposed method allows for the re-editing of generated results by changing the pose scripts for applications such as interactive digital humans. The experimental results show that this method generates more quality, more diverse, and realistic gestures than other existing methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.