Abstract

Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance.

Highlights

  • Emotions play a crucial role in our life decisions

  • Combining the aural and visual emotion recognizers allowed one to achieve an overall accuracy of 86.70%, against the 62.13% of the visual modality and the 81.82% of the top speech-based strategy

  • This accuracy was obtained by a multinomial logistic regression when we combined and normalized the posteriors of the FT-xlsr-Wav2Vec2.0 model for speech emotion recognizer (SER) (81.82% of accuracy), the bi-long short-term memory (LSTM) with attention mechanism for facial emotion recognizer (FER) (62.13% of accuracy), and the Static-multilayer perceptron (MLP) of 80 neurons for FER fed with the averaged and normalized Action Units (AUs)

Read more

Summary

Introduction

Emotions play a crucial role in our life decisions. Comprehending them awakens interest due to their potential applications since knowing how others feel allows us to interact and transmit information more effectively. With the help of an emotion recognizer, other systems could detect loss of trust or changes in emotions by monitoring people’s conduct This capability will help specific systems such as Embodied Conversational Agents (ECAs) [1,2] to react to these events and adapt their decisions to improve conversations by adjusting their tone or facial expressions to create a better socio-affective user experience [3]. Automobile safety is another important application of facial expression recognition.

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call