Affective computing is one of the most important research fields in modern human–computer interaction (HCI). The goal of affective computing is to study and develop the theories, methods, and systems that can recognize, explain, process, and simulate human emotions. As a branch of affective computing, emotion recognition aims to enlighten the machine/computer automatically analyzing human emotions, which has received increasing attention from researchers in various fields. Human beings generally observe and understand the emotional states of one person by integrating the perceived information from his/her facial expressions, voice tone, speech content, behavior, or physiological features. To imitate the emotion observation manner of humans, researchers have been devoted to constructing multimodal emotion recognition models by fusing information from two or more modalities. In this paper, we provide a comprehensive review of multimodal emotion recognition from the perspectives of multimodal datasets, data preprocessing, unimodal feature extraction, and multimodal information fusion methods in recent decades. Furthermore, challenges and future research directions of the topic are specified and discussed. The main motivations of this review are to conclude the recent emergence of abundant works on multimodal emotion recognition and to provide potential guidance to researchers in the related field for understanding the pipeline and mainstream approaches to multimodal emotion recognition.
Read full abstract