Audio-visual Emotion Recognition Research Articles

Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. Firstly, it incorporates hierarchical skip connections between the encoder and decoder to encourage intermediate layers to learn more meaningful representations and bolster masked audio-visual reconstruction. Secondly, hierarchical cross-modal contrastive learning is also exerted on intermediate representations to narrow the audio-visual modality gap progressively and facilitate subsequent cross-modal fusion. Finally, during downstream fine-tuning, HiCMAE employs hierarchical feature fusion to comprehensively integrate multi-level features from different layers. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models are publicly available at https://github.com/sunlicai/HiCMAE.

Read full abstract

PurposeAlthough numerous signal modalities are available for emotion recognition, audio and visual modalities are the most common and predominant forms for human beings to express their emotional states in daily communication. Therefore, how to achieve automatic and accurate audiovisual emotion recognition is significantly important for developing engaging and empathetic human–computer interaction environment. However, two major challenges exist in the field of audiovisual emotion recognition: (1) how to effectively capture representations of each single modality and eliminate redundant features and (2) how to efficiently integrate information from these two modalities to generate discriminative representations.Design/methodology/approachA novel key-frame extraction-based attention fusion network (KE-AFN) is proposed for audiovisual emotion recognition. KE-AFN attempts to integrate key-frame extraction with multimodal interaction and fusion to enhance audiovisual representations and reduce redundant computation, filling the research gaps of existing approaches. Specifically, the local maximum–based content analysis is designed to extract key-frames from videos for the purpose of eliminating data redundancy. Two modules, including “Multi-head Attention-based Intra-modality Interaction Module” and “Multi-head Attention-based Cross-modality Interaction Module”, are proposed to mine and capture intra- and cross-modality interactions for further reducing data redundancy and producing more powerful multimodal representations.FindingsExtensive experiments on two benchmark datasets (i.e. RAVDESS and CMU-MOSEI) demonstrate the effectiveness and rationality of KE-AFN. Specifically, (1) KE-AFN is superior to state-of-the-art baselines for audiovisual emotion recognition. (2) Exploring the supplementary and complementary information of different modalities can provide more emotional clues for better emotion recognition. (3) The proposed key-frame extraction strategy can enhance the performance by more than 2.79 per cent on accuracy. (4) Both exploring intra- and cross-modality interactions and employing attention-based audiovisual fusion can lead to better prediction performance.Originality/valueThe proposed KE-AFN can support the development of engaging and empathetic human–computer interaction environment.

Read full abstract

Audio-visual Emotion Recognition Research Articles

Related Topics

Articles published on Audio-visual Emotion Recognition

Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset

Deep operational audio-visual emotion recognition

HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition

Enhancing Emotion Recognition through Federated Learning: A Multimodal Approach with Convolutional Neural Networks

Kernel Probabilistic Dependent-Independent Canonical Correlation Analysis

Emotional dampening in hypertension: Impaired recognition of implicit emotional content in auditory and cross-modal stimuli.

A Neural Network Architecture for Children’s Audio–Visual Emotion Recognition

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Audio-Visual Emotion Recognition With Preference Learning Based on Intended and Multi-Modal Perceived Labels

Robust Audiovisual Emotion Recognition: Aligning Modalities, Capturing Temporal Information, and Handling Missing Features

End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild

Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN

An Enhanced CNN-2D for Audio-Visual Emotion Recognition (AVER) Using ADAM Optimizer

Leveraging recent advances in deep learning for audio-Visual emotion recognition

Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition

Degraded visual and auditory input individually impair audiovisual emotion recognition from speech-like stimuli, but no evidence for an exacerbated effect from combined degradation

Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition

Continuous Audiovisual Emotion Recognition Using Feature Selection and LSTM

Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Internet of emotional people: Towards continual affective computing cross cultures via audiovisual signals

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Audio-visual Emotion Recognition Research Articles

Related Topics

Articles published on Audio-visual Emotion Recognition

Audiovisual emotion recognition based on bi-layer LSTM and multi-head attention mechanism on RAVDESS dataset

Deep operational audio-visual emotion recognition

HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition

Enhancing Emotion Recognition through Federated Learning: A Multimodal Approach with Convolutional Neural Networks

Kernel Probabilistic Dependent-Independent Canonical Correlation Analysis

Emotional dampening in hypertension: Impaired recognition of implicit emotional content in auditory and cross-modal stimuli.

A Neural Network Architecture for Children’s Audio–Visual Emotion Recognition

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Audio-Visual Emotion Recognition With Preference Learning Based on Intended and Multi-Modal Perceived Labels

Robust Audiovisual Emotion Recognition: Aligning Modalities, Capturing Temporal Information, and Handling Missing Features

End-to-End Modeling and Transfer Learning for Audiovisual Emotion Recognition in-the-Wild

Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN

An Enhanced CNN-2D for Audio-Visual Emotion Recognition (AVER) Using ADAM Optimizer

Leveraging recent advances in deep learning for audio-Visual emotion recognition

Information Fusion in Attention Networks Using Adaptive and Multi-Level Factorized Bilinear Pooling for Audio-Visual Emotion Recognition

Degraded visual and auditory input individually impair audiovisual emotion recognition from speech-like stimuli, but no evidence for an exacerbated effect from combined degradation

Fusion of deep learning features with mixture of brain emotional learning for audio-visual emotion recognition

Continuous Audiovisual Emotion Recognition Using Feature Selection and LSTM

Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

Internet of emotional people: Towards continual affective computing cross cultures via audiovisual signals