Abstract

Realistic co-speech gestures are important to anthropomorphize ECAs, as nonverbal behavior improves expressiveness of their speech greatly. However, the existing approaches to generating co-speech gestures with sufficient details (including fingers, etc.) in 3D scenarios are indeed rare. Additionally, they hardly address the problem of abnormal gestures, temporal–spatial coherence and diversity of gesture sequences comprehensively. To handle abnormal gesture issues, we put forward an angle conversion method to remove body part length from the original in-the-wild video dataset via transferring coordinates of human upper body key points into relative deflection angles and pitch angles. We also propose a neural network called HARP with encoder–decoder architecture to transfer MFCC featured speech audio into aforementioned angles on the basis of CNN and LSTM. The angles then can be rendered as corresponding co-speech gestures. Compared with the other latest approaches, the co-speech gestures generated by HARP are proved to be almost as good as the real person, i.e., they have strong temporal–spatial coherence, diversity, persuasiveness and credibility. Our approach puts finer control on co-speech gestures than most of the existing works by handling all key points of the human upper body. It is more feasible for industrial application, since HARP can be adaptive to any human upper body model. All related code and evidence videos of HARP can be accessed at https://github.com/drrobincroft/HARP .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call