Abstract

This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information. We provide a new, descriptive categorization of methods, based on the way they handle the inter-modality and intra-modality dynamics in the temporal dimension: (i) non-temporal architectures (NTA), which do not significantly model the temporal dimension in both unimodal and multimodal interaction; (ii) pseudo-temporal architectures (PTA), which also assume an oversimplification of the temporal dimension, although in one of the unimodal or multimodal interactions; and (iii) temporal architectures (TA), which try to capture both unimodal and cross-modal temporal dependencies. In addition, we review the basic feature representation methods for each modality, and we present aggregated evaluation results on the reported methodologies. Finally, we conclude this work with an in-depth analysis of the future challenges related to validation procedures, representation learning and method robustness.

Highlights

  • The task of recognizing emotions in multimodal speech signals is vital and very challenging in the context of human–computer or human–human interaction applications

  • We propose three general categories, namely: (i) non-temporal architectures, i.e., approaches that simplify the modeling of the temporal dimension in both unimodal and multimodal interactions, by assuming simple statistical representations; (ii) pseudotemporal architectures, which assume an oversimplification of the temporal dimension, in either unimodal or multimodal interactions; and (iii) temporal architectures, i.e., methods that try to capture both unimodal and cross-modal temporal dependencies

  • Some works present results for other datasets such as CMU-MOSEI (e.g., [63,66,84]) or REVOLA (e.g., [57]), but these works can be only grouped in small sets, and we think that the best overview of the field architectures can only be seen in an IEMOCAP comparison

Read more

Summary

Introduction

The task of recognizing emotions in multimodal speech signals is vital and very challenging in the context of human–computer or human–human interaction applications. In such applications, interaction is characterized by the content of the dialogues but by how the involved parts feel when expressing their thoughts. Theories of language origin identify the combination of language and nonverbal behaviors (vision and acoustic modality) as the prime form of communication utilized by humans throughout evolution [2]. Why is it important for speech emotion recognition applications to adopt multimodality in the core of their architecture? In order for a multimodal model to function similar to the way our brain perceives multimodality, it should satisfy both of the aforementioned requirements

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call