Multimodal Emotion Research Articles

Emotion recognition models using audio input data can enable the development of interactive systems with applications in mental healthcare, marketing, gaming, and social media analysis. While the field of affective computing using audio data is rich, a major barrier to achieve consistently high-performance models is the paucity of available training labels. Self-supervised learning (SSL) is a family of methods which can learn despite a scarcity of supervised labels by predicting properties of the data itself. To understand the utility of self-supervised learning for audio-based emotion recognition, we have applied self-supervised learning pre-training to the classification of emotions from the CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU- MOSEI)'s acoustic data. Unlike prior papers that have experimented with raw acoustic data, our technique has been applied to encoded acoustic data with 74 parameters of distinctive audio features at discrete timesteps. Our model is first pre-trained to uncover the randomly masked timestamps of the acoustic data. The pre-trained model is then fine-tuned using a small sample of annotated data. The performance of the final model is then evaluated via overall mean absolute error (MAE), mean absolute error (MAE) per emotion, overall four-class accuracy, and four-class accuracy per emotion. These metrics are compared against a baseline deep learning model with an identical backbone architecture. We find that self-supervised learning consistently improves the performance of the model across all metrics, especially when the number of annotated data points in the fine-tuning step is small. Furthermore, we quantify the behaviors of the self-supervised model and its convergence as the amount of annotated data increases. This work characterizes the utility of self-supervised learning for affective computing, demonstrating that self-supervised learning is most useful when the number of training examples is small and that the effect is most pronounced for emotions which are easier to classify such as happy, sad, and angry. This work further demonstrates that self-supervised learning still improves performance when applied to the embedded feature representations rather than the traditional approach of pre-training on the raw input space.

Read full abstract

Multimodal emotion recognition has become a hot topic in human-computer interaction and intelligent healthcare fields. However, combining information from different human different modalities for emotion computation is still challenging. In this paper, we propose a three-dimensional convolutional recurrent neural network model (referred to as 3FACRNN network) based on multimodal fusion and attention mechanism. The 3FACRNN network model consists of a visual network and an EEG network. The visual network is composed of a cascaded convolutional neural network-time convolutional network (CNN-TCN). In the EEG network, the 3D feature building module was added to integrate band information, spatial information and temporal information of the EEG signal, and the band attention and self-attention modules were added to the convolutional recurrent neural network (CRNN). The former explores the effect of different frequency bands on network recognition performance, while the latter is to obtain the intrinsic similarity of different EEG samples. To investigate the effect of different frequency bands on the experiment, we obtained the average attention mask for all subjects in different frequency bands. The distribution of the attention masks across the different frequency bands suggests that signals more relevant to human emotions may be active in the high frequency bands γ (31-50 Hz). Finally, we try to use the multi-task loss function Lc to force the approximation of the intermediate feature vectors of the visual and EEG modalities, with the aim of using the knowledge of the visual modalities to improve the performance of the EEG network model. The mean recognition accuracy and standard deviation of the proposed method on the two multimodal sentiment datasets DEAP and MAHNOB-HCI (arousal, valence) were 96.75 ± 1.75, 96.86 ± 1.33; 97.55 ± 1.51, 98.37 ± 1.07, better than those of the state-of-the-art multimodal recognition approaches. The experimental results show that starting from the multimodal information, the facial video frames and electroencephalogram (EEG) signals of the subjects are used as inputs to the emotion recognition network, which can enhance the stability of the emotion network and improve the recognition accuracy of the emotion network. In addition, in future work, we will try to utilize sparse matrix methods and deep convolutional networks to improve the performance of multimodal emotion networks.

Read full abstract

Multimodal Emotion Research Articles

Related Topics

Articles published on Multimodal Emotion

An efficient multimodal sentiment analysis in social media using hybrid optimal multi-scale residual attention network

PriMonitor: An adaptive tuning privacy-preserving approach for multimodal emotion detection

COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition.

Acoustic and visual geometry descriptor for multi-modal emotion recognition fromvideos

An Emotion Recognition Method Based on Eye Movement and Audiovisual Features in MOOC Learning Environment

Fusing facial and speech cues for enhanced multimodal emotion recognition

Emotion generation method in online physical education teaching based on data mining of teacher-student interactions.

Multimodal emotion recognition based on the fusion of vision, EEG, ECG, and EMG signals

Audio-Based Emotion Recognition Using Self-Supervised Learning on an Engineered Feature Space.

Encoding of multi-modal emotional information via personalized skin-integrated wireless facial interface

A multimodal teacher speech emotion recognition method in the smart classroom

Auditive Emotion Recognition for Empathic AI-Assistants

Attention-based 3D convolutional recurrent neural network model for multimodal emotion recognition.

Multimodal Real-Time patient emotion recognition system using facial expressions and brain EEG signals based on Machine learning and Log-Sync methods

A novel transformer autoencoder for multi-modal emotion recognition with incomplete data

Multimodal Emotion Recognition based on EEG and EOG Signals evoked by the Video-odor Stimuli.

Research on cross-modal emotion recognition based on multi-layer semantic fusion.

Application and Effectiveness Analysis of Multimodal Emotion Recognition Technology in Music Education

Constructing a Multimodal Music Teaching Model in College by Integrating Emotions

DER-GCN: Dialog and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialog Emotion Recognition.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Multimodal Emotion Research Articles

Related Topics

Articles published on Multimodal Emotion

An efficient multimodal sentiment analysis in social media using hybrid optimal multi-scale residual attention network

PriMonitor: An adaptive tuning privacy-preserving approach for multimodal emotion detection

COLD Fusion: Calibrated and Ordinal Latent Distribution Fusion for Uncertainty-Aware Multimodal Emotion Recognition.

Acoustic and visual geometry descriptor for multi-modal emotion recognition fromvideos

An Emotion Recognition Method Based on Eye Movement and Audiovisual Features in MOOC Learning Environment

Fusing facial and speech cues for enhanced multimodal emotion recognition

Emotion generation method in online physical education teaching based on data mining of teacher-student interactions.

Multimodal emotion recognition based on the fusion of vision, EEG, ECG, and EMG signals

Audio-Based Emotion Recognition Using Self-Supervised Learning on an Engineered Feature Space.

Encoding of multi-modal emotional information via personalized skin-integrated wireless facial interface

A multimodal teacher speech emotion recognition method in the smart classroom

Auditive Emotion Recognition for Empathic AI-Assistants

Attention-based 3D convolutional recurrent neural network model for multimodal emotion recognition.

Multimodal Real-Time patient emotion recognition system using facial expressions and brain EEG signals based on Machine learning and Log-Sync methods

A novel transformer autoencoder for multi-modal emotion recognition with incomplete data

Multimodal Emotion Recognition based on EEG and EOG Signals evoked by the Video-odor Stimuli.

Research on cross-modal emotion recognition based on multi-layer semantic fusion.

Application and Effectiveness Analysis of Multimodal Emotion Recognition Technology in Music Education

Constructing a Multimodal Music Teaching Model in College by Integrating Emotions

DER-GCN: Dialog and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialog Emotion Recognition.