Abstract
Like a human listener, a listener agent reacts to its communicational partners' non-verbal behaviors such as head nods, facial expressions, and voice tone. When adopting these modalities as inputs and develop the generative model of reactive and spontaneous behaviors using machine learning techniques, the issues of multimodal fusion emerge. That is, the effectiveness of different modalities, frame-wise interaction of multiple modalities, and temporal feature extraction of individual modalities. This paper describes our investigation on these issues of the task in generating of virtual listeners' reactive and spontaneous idling behaviors. The work is based on the comparison of corresponding recurrent neural network (RNN) configurations in the performance of generating listener's (the agent) head movements, gaze directions, facial expressions, and postures from the speaker's head movements, gaze directions, facial expressions, and voice tone. A data corpus recorded in a subject experiment of active listening is used as the ground truth. The results showed that video information is more effective than audio information, and frame-wise interaction of modalities is more effective than temporal characteristics of individual modalities.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.