Abstract

People naturally enhance their speeches with body motion or gestures. Generating human gestures for digital humans or virtual avatars from speech audio or text remains challenging for its indeterministic nature. We observe that existing neural methods often give gestures with an inadequate amount of movement shift, which can be characterized as slow or dull. Thus, we propose a novel generative model coupled with memory networks to work as dynamic dictionaries for generating gestures with improved diversity. Under the hood of the proposed model, a dictionary network dynamically stores previously appeared pose features corresponding to text features for the generator to lookup, while a pose generation network takes in audio and pose features and outputs the resulting gesture sequences. Seed poses are utilized in the generation process to guarantee the continuity between two speech segments. We also propose a new objective evaluation metric for diversity of generated gestures and succeed in demonstrating that the proposed model has the ability to generate gestures with improved diversity.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.