Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN

Fei Ma,Shiguang Ni,Lin Zhang,Yang Li,Shao-Lun Huang

doi:10.3390/app12010527

Abstract

Audio-visual emotion recognition is the research of identifying human emotional states by combining the audio modality and the visual modality simultaneously, which plays an important role in intelligent human-machine interactions. With the help of deep learning, previous works have made great progress for audio-visual emotion recognition. However, these deep learning methods often require a large amount of data for training. In reality, data acquisition is difficult and expensive, especially for the multimodal data with different modalities. As a result, the training data may be in the low-data regime, which cannot be effectively used for deep learning. In addition, class imbalance may occur in the emotional data, which can further degrade the performance of audio-visual emotion recognition. To address these problems, we propose an efficient data augmentation framework by designing a multimodal conditional generative adversarial network (GAN) for audio-visual emotion recognition. Specifically, we design generators and discriminators for audio and visual modalities. The category information is used as their shared input to make sure our GAN can generate fake data of different categories. In addition, the high dependence between the audio modality and the visual modality in the generated multimodal data is modeled based on Hirschfeld-Gebelein-Rényi (HGR) maximal correlation. In this way, we relate different modalities in the generated data to approximate the real data. Then, the generated data are used to augment our data manifold. We further apply our approach to deal with the problem of class imbalance. To the best of our knowledge, this is the first work to propose a data augmentation strategy with a multimodal conditional GAN for audio-visual emotion recognition. We conduct a series of experiments on three public multimodal datasets, including eNTERFACE’05, RAVDESS, and CMEW. The results indicate that our multimodal conditional GAN has high effectiveness for data augmentation of audio-visual emotion recognition.

Highlights

Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China; Abstract: Audio–visual emotion recognition is the research of identifying human emotional states by combining the audio modality and the visual modality simultaneously, which plays an important role in intelligent human–machine interactions
Our proposed multimodal conditional generative adversarial network (GAN) is a generalization of existing GANs for data augmentation to improve the performance of audio–visual emotion recognition
We find the following summarizations: (1) Our approach achieves the highest performance compared to other methods, which shows that the data generated using our multimodal conditional GAN can significantly benefit audio–visual emotion recognition

Summary

Introduction with regard to jurisdictional claims in

The task of emotion recognition is to detect human affective states. It is crucial for affect-related human–machine interactions, which has attracted a lot of attention from researchers [1,2,3,4,5,6,7,8]. Our proposed multimodal conditional GAN is a generalization of existing GANs for data augmentation to improve the performance of audio–visual emotion recognition. Additional category information is used as their shared input to generate fake data of different categories It is shown in [17,47,48,49] that in the real multimodal data, the audio modality and the visual modality are highly dependent, which is beneficial to emotion recognition. We conduct experiments on three public multimodal datasets to show that our multimodal conditional GAN can be effectively used for data augmentation of audio–visual emotion recognition. To the best of our knowledge, this is the first work to propose an efficient data augmentation approach with a multimodal conditional GAN for audio–visual emotion recognition.

Multimodal Learning

Overview

Proposed Multimodal Conditional GAN

DNN Classifier

Datasets

Networks

Implementation Details

Experiment Results

Method

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Jan 5, 2022
Citations: 36	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Sketch-guided spatial adaptive normalization and high-level feature constraints based GAN image synthesis for steel strip defect detection data augmentation
Guangjun Ran ... Shuhui Ou
Measurement Science and Technology | VOL. 35
Guangjun Ran, et. al.Guangjun Ran ... Shuhui Ou
23 Jan 2024
Measurement Science and Technology | VOL. 35

Lung Diseases Diagnosis-Based Deep Learning Methods: A Review
Shahad A Salih ... Inas Jawad Kadhim
Journal of Techniques | VOL. 5
Shahad A Salih, et. al. Shahad A Salih ... Inas Jawad Kadhim
25 Sep 2023
Journal of Techniques | VOL. 5

Survey on the research direction of EEG-based signal processing.
Congzhong Sun ... Chaozhou Mou
Frontiers in neuroscience | VOL. 17
Congzhong Sun, et. al.Congzhong Sun ... Chaozhou Mou
13 Jul 2023
Frontiers in neuroscience | VOL. 17

Isocitrate dehydrogenase (IDH) status prediction in histopathology images of gliomas using deep learning
Sidong Liu ... Antonio Di Ieva
Scientific Reports | VOL. 10
Sidong Liu, et. al.Sidong Liu ... Antonio Di Ieva
07 May 2020
Scientific Reports | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences