Abstract
Emotion recognition is a strategy for social robots used to implement better Human-Robot Interaction and model their social behaviour. Since human emotions can be expressed in different ways (e.g., face, gesture, voice), multimodal approaches are useful to support the recognition process. However, although there exist studies dealing with multimodal emotion recognition for social robots, they still present limitations in the fusion process, dropping their performance if one or more modalities are not present or if modalities have different qualities. This is a common situation in social robotics, due to the high variety of the sensory capacities of robots; hence, more flexible multimodal models are needed. In this context, we propose an adaptive and flexible emotion recognition architecture able to work with multiple sources and modalities of information and manage different levels of data quality and missing data, to lead robots to better understand the mood of people in a given environment and accordingly adapt their behaviour. Each modality is analyzed independently to then aggregate the partial results with a previous proposed fusion method, called EmbraceNet+, which is adapted and integrated to our proposed framework. We also present an extensive review of state-of-the-art studies dealing with fusion methods for multimodal emotion recognition approaches. We evaluate the performance of our proposed architecture by performing different tests in which several modalities are combined to classify emotions using four categories (i.e., happiness, neutral, sadness, and anger). Results reveal that our approach is able to adapt to the quality and presence of modalities. Furthermore, results obtained are validated and compared with other similar proposals, obtaining competitive performance with state-of-the-art models.
Highlights
In people social interactions, emotion detection is a natural process that directly affects people’s decision-making and actions during communication
There exist studies dealing with multimodal emotion recognition for social robots [7], [17], [18], they still present a limitation in the fusion process: they can drop their performance if one or more modalities are not present or if modalities have different qualities. This is a common situation in social robotics, since robots can have a high variety of sensory capacities and might capture the word through different sources and with different levels of quality; more flexible multimodal models are needed
We review two groups of late fusion methods: those based on Multi Layer Perceptron (MLP) [33]–[35] and those based on more complex models, such as combinations of Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long ShortTerm Memory (LSTM), and others [36]–[39]
Summary
Emotion detection is a natural process that directly affects people’s decision-making and actions during communication. Robots can detect the emotion of human beings through visual perception [1], speech [2], nonverbal communication [3], mutual interaction [4], among others methods. In this sense, new proposals for social robots to detect emotions have become more naturalized and faster in recent years for better understanding of how to communicate with people [5].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.