Abstract
In this work, we established the foundations of a framework with the goal to build an end-to-end naturalistic expressive listening agent. The project was split into modules for recognition of the user’s paralinguistic and nonverbal expressions, prediction of the agent’s reactions, synthesis of the agent’s expressions and data recordings of nonverbal conversation expressions. First, a multimodal multitask deep learning-based emotion classification system was built along with a rule-based visual expression detection system. Then several sequence prediction systems for nonverbal expressions were implemented and compared. Also, an audiovisual concatenation-based synthesis system was implemented. Finally, a naturalistic, dyadic emotional conversation database was collected. We report here the work made for each of these modules and our planned future improvements.
Highlights
This project is part of the eNTERFACE’17 Workshop. eNTERFACE is a multidisciplinary workshop focusing on multimodal interfaces
Since we focus on nonverbal expressions, the subject of the discussion is not relevant as the goal of the agent is to react to nonverbal expressions with nonverbal expressions
The expressions considered in this project are: laughs and smiles and their intensity dimensions, head movements and eyebrow movements, for they frequently occur in dyadic interactions
Summary
This project is part of the eNTERFACE’17 Workshop. eNTERFACE is a multidisciplinary workshop focusing on multimodal interfaces. Every year, researchers from around the world to work on different projects for a month The goal of this project is to build a listening agent that would react to a user using mainly nonverbal expressions. Recognition will detect/recognize relevant expressions, from which the prediction system will take a decision on what should be the agent’s reaction This reaction is generated by the synthesis module. The expressions considered in this project are: laughs and smiles and their intensity dimensions, head movements (nodding, shaking and tilting) and eyebrow movements (raise and frown), for they frequently occur in dyadic interactions. These expressions are a part of all previously mentioned modules. Since our agent should work in noisy environments and with “in the wild” data, we chose, for this module, to work with the RECOLA and SEWA databases which meet our requirements
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.