Abstract

Emotion recognition is a strategy for social robots used to implement better Human-Robot Interaction and model their social behaviour. Since human emotions can be expressed in different ways (e.g., face, gesture, voice), multimodal approaches are useful to support the recognition process. However, although there exist studies dealing with multimodal emotion recognition for social robots, they still present limitations in the fusion process, dropping their performance if one or more modalities are not present or if modalities have different qualities. This is a common situation in social robotics, due to the high variety of the sensory capacities of robots; hence, more flexible multimodal models are needed. In this context, we propose an adaptive and flexible emotion recognition architecture able to work with multiple sources and modalities of information and manage different levels of data quality and missing data, to lead robots to better understand the mood of people in a given environment and accordingly adapt their behaviour. Each modality is analyzed independently to then aggregate the partial results with a previous proposed fusion method, called EmbraceNet+, which is adapted and integrated to our proposed framework. We also present an extensive review of state-of-the-art studies dealing with fusion methods for multimodal emotion recognition approaches. We evaluate the performance of our proposed architecture by performing different tests in which several modalities are combined to classify emotions using four categories (i.e., happiness, neutral, sadness, and anger). Results reveal that our approach is able to adapt to the quality and presence of modalities. Furthermore, results obtained are validated and compared with other similar proposals, obtaining competitive performance with state-of-the-art models.

Highlights

  • In people social interactions, emotion detection is a natural process that directly affects people’s decision-making and actions during communication

  • There exist studies dealing with multimodal emotion recognition for social robots [7], [17], [18], they still present a limitation in the fusion process: they can drop their performance if one or more modalities are not present or if modalities have different qualities. This is a common situation in social robotics, since robots can have a high variety of sensory capacities and might capture the word through different sources and with different levels of quality; more flexible multimodal models are needed

  • We review two groups of late fusion methods: those based on Multi Layer Perceptron (MLP) [33]–[35] and those based on more complex models, such as combinations of Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long ShortTerm Memory (LSTM), and others [36]–[39]

Read more

Summary

Introduction

Emotion detection is a natural process that directly affects people’s decision-making and actions during communication. Robots can detect the emotion of human beings through visual perception [1], speech [2], nonverbal communication [3], mutual interaction [4], among others methods. In this sense, new proposals for social robots to detect emotions have become more naturalized and faster in recent years for better understanding of how to communicate with people [5].

Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call