Learning Better Representations for Audio-Visual Emotion Recognition with Common Information

  • Abstract
  • Highlights & Summary
  • PDF
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Audio-visual emotion recognition aims to distinguish human emotional states by integrating the audio and visual data acquired in the expression of emotions. It is crucial for facilitating the affect-related human-machine interaction system by enabling machines to intelligently respond to human emotions. One challenge of this problem is how to efficiently extract feature representations from audio and visual modalities. Although progresses have been made by previous works, most of them ignore common information between audio and visual data during the feature learning process, which may limit the performance since these two modalities are highly correlated in terms of their emotional information. To address this issue, we propose a deep learning approach in order to efficiently utilize common information for audio-visual emotion recognition by correlation analysis. Specifically, we design an audio network and a visual network to extract the feature representations from audio and visual data respectively, and then employ a fusion network to combine the extracted features for emotion prediction. These neural networks are trained by a joint loss, combining: (i) the correlation loss based on Hirschfeld-Gebelein-Rényi (HGR) maximal correlation, which extracts common information between audio data, visual data, and the corresponding emotion labels, and (ii) the classification loss, which extracts discriminative information from each modality for emotion prediction. We further generalize our architecture to the semi-supervised learning scenario. The experimental results on the eNTERFACE’05 dataset, BAUM-1s dataset, and RAVDESS dataset show that common information can significantly enhance the stability of features learned from different modalities, and improve the emotion recognition performance.

Similar Papers
  • Conference Article
  • Cite Count Icon 9
  • 10.1109/ficloud.2019.00054
The Audio-Visual Arabic Dataset for Natural Emotions
  • Aug 1, 2019
  • Ftoon Abu Shaqra + 2 more

Emotions are a crucial aspect of human life and the researchers have tried to build an automatic emotion recognition system that helps to provide important real-world applications. The psychologists have shown that emotions differ across culture, considering this fact, we provide and describe the first audio-visual Arabic emotional dataset which called (AVANEmo). In this work we aim to fill the gap between studies of emotion recognition for Arabic content and other languages by provided an Arabic dataset which is a major and fundamental part of build emotion recognition application. Our dataset contains 3000 clips for video and audio data, and it covers six basic emotional labels (Happy, Sad, Angry, Surprise, Disgust, Neutral). Also, we provide some baseline experiments to measure the primitive performance for automated audio and visual emotion recognition application using the AVANEmo dataset. The best accuracy that we achieved was 54.5% and 57.9% using the audio and visual data respectively. The data will be available for distribution to researchers.

  • Research Article
  • 10.55041/ijsrem37976
Cross-Modal Harmony: Low-Rank Fusion for Enhanced Artificial Emotion Recognition
  • Oct 16, 2024
  • INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT
  • Himanshu Kumar + 1 more

Emotion recognition has become a subject of considerable interest in recent times, owing to its diverse and far-reaching applications in various fields. These applications span from enhancing human-computer interactions to assessing mental health and improving entertainment systems. The proposed study presents a novel approach for emotion recognition by fusing audio and video modalities using low-rank fusion techniques. The proposed methodology leverages the complementary nature of audio and video data in capturing emotional cues. Audio data often encapsulates tone, speech patterns, and vocal nuances, while video data captures facial expressions, body language, and gestures. However, the challenge lies in effectively integrating these two modalities to enhance recognition accuracy. To address the challenge, it employs low-rank fusion, a dimensionality reduction technique that extracts the most informative features from both modalities while minimizing redundancy. Furthermore, it presents the implementation of the chosen low-rank fusion algorithm in a real-world emotion recognition system. The results can contribute to advancing the field of emotion recognition by providing a practical and efficient solution for combining audio and video data to achieve more robust and accurate emotion classification. Keywords: Deep Learning; Emotion Recognition; Human-Computer Interaction; Low-Rank Fusion; Multimodal Fusion.

  • Research Article
  • Cite Count Icon 128
  • 10.1016/j.inffus.2018.06.003
Audio-visual emotion fusion (AVEF): A deep efficient weighted approach
  • Jun 15, 2018
  • Information Fusion
  • Yaxiong Ma + 5 more

Audio-visual emotion fusion (AVEF): A deep efficient weighted approach

  • Research Article
  • Cite Count Icon 11
  • 10.1109/tmm.2013.2279659
Visual Speech Synthesis Using a Variable-Order Switching Shared Gaussian Process Dynamical Model
  • Dec 1, 2013
  • IEEE Transactions on Multimedia
  • Salil Deena + 2 more

In this paper, we present a novel approach to speech- driven facial animation using a non-parametric switching state space model based on Gaussian processes. The model is an extension of the shared Gaussian process dynamical model, augmented with switching states. Two talking head corpora are processed by extracting visual and audio data from the sequences followed by a parameterization of both data streams. Phonetic labels are obtained by performing forced phonetic alignment on the audio. The switching states are found using a variable length Markov model trained on the labelled phonetic data. The audio and visual data corresponding to phonemes matching each switching state are extracted and modelled together using a shared Gaussian process dynamical model. We propose a synthesis method that takes into account both previous and future phonetic context, thus accounting for forward and backward coarticulation in speech. Both objective and subjective evaluation results are presented. The quantitative results demonstrate that the proposed method outperforms other state-of-the-art methods in visual speech synthesis and the qualitative results reveal that the synthetic videos are comparable to ground truth in terms of visual perception and intelligibility.

  • Research Article
  • Cite Count Icon 5
  • 10.1109/access.2023.3257565
Audio-to-Visual Cross-Modal Generation of Birds
  • Jan 1, 2023
  • IEEE Access
  • Joo Yong Shim + 2 more

Audio and visual modal data are essential elements of precise investigation in many fields. Sometimes it is difficult to obtain visual data while auditory data is easily available. In this case, generating visual data using audio data will be very helpful. This paper proposes a novel audio-to-visual cross-modal generation approach. The proposed sound encoder extracts the features of the auditory data and a generative model generates images using those audio features. This model is expected to learn (i) valid feature representation and (ii) associations between generated images and audio inputs to generate realistic and well-classified images. A new dataset is collected for this research called the Audio-Visual Corresponding Bird (AVC-B) dataset which contains the sounds and corresponding images of 10 different bird species. The experimental results show that the proposed method can generate class-appropriate images and achieve better classification results than the state-of-the-art methods.

  • Conference Article
  • Cite Count Icon 103
  • 10.1109/cvprw56347.2022.00511
M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation
  • Jun 1, 2022
  • Vishal Chudasama + 5 more

Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-the-art performance in ERC.

  • Conference Article
  • Cite Count Icon 7
  • 10.1145/2567688.2567690
Emotion Recognition from Audio and Visual Data using F-score based Fusion
  • Mar 21, 2014
  • Abhishek Gera + 1 more

Emotion recognition has been one of the cornerstones of human-computer interaction. Although decades of work has attacked the problem of automatic emotion recognition from either audio or video signals, the fusion of the two modalities is more recent. In this paper, we aim to tackle the problem when both audio and video data are available in a synchronized manner. We address the six basic human emotions, namely, anger, disgust, fear, happiness, sadness, and surprise. We employ an automatic face tracker to extract the different facial points of interest from a video. We then compute feature vectors for each video frame using distances and angles between the tracked points. For audio data, we use the pitch, energy and MFCC to derive feature vectors for each window as well as the entire audio signal. We use two standard techniques, GMM-based HMM and SVM, as the base classifiers. We then design a novel fusion method using the F-score of the base classifiers. We first demonstrate that our fusion approach can increase the accuracy of the base classifiers by as much as 5%. Finally, we show that our fusion-based bi-modal emotion recognition method achieves an overall accuracy of 54% on a publicly available database, which is an improvement upon the current state-of-the-art by 9%.

  • Research Article
  • 10.52783/cana.v32.4274
A Unified Framework for Multimodal Emotion Recognition: Leveraging Text, Audio, and Visual Data for Enhanced Emotional Understanding
  • Mar 12, 2025
  • Communications on Applied Nonlinear Analysis
  • Sanjeeva Rao Sanku, B.Sandhya

Emotion recognition based on multimodal data (e.g., video, audio, text, etc.) is a highly demanding and significant research field with numerous applications. This research rigorously explores model level fusion to find the best multifunctional model combining audio and visual modalities for emotion identification. Specifically, it proposes novel feature extractor networks for both audio and video data. This research presents a comprehensive approach to multimodal emotion recognition, utilizing state-of-the-art feature extraction methods tailored to each modality. For text data, we implement the Assimilated N-gram Approach (ANA) to effectively capture contextual information. Audio features are extracted using Mel-Frequency Cepstral Coefficients (MFCC), ideal for capturing spectral characteristics in speech. Visual features are derived using Squeezenet, a deep learning architecture optimized for efficient and informative visual data representation. To integrate the extracted features from text, audio, and visual modalities, propose a multimodal data fusion strategy that combines information across modalities, thereby enhancing the overall representation of emotional cues. In the classification stage, employ Capsule Net, a novel neural network architecture adept at capturing hierarchical relationships and spatial hierarchies within data, making it well-suited for handling complex multimodal data. To further optimize the performance of the Capsule Net classifier, utilize hyper parameter tuning through the Sand Cat Swarm Optimization (SCSO) algorithm. SCSO, a metaheuristic optimization technique inspired by the behavior of sand cats, iteratively updates candidate solutions to converge towards optimal hyperparameter configurations. Using the Multimodal Emotion Lines Dataset (MELD), our approach achieved an accuracy of 98.91%, precision of 98.83%, recall of 99.04%, and F-measure of 98.94. These results highlight the effectiveness of our multimodal framework in emotion recognition tasks.

  • Book Chapter
  • Cite Count Icon 12
  • 10.1108/s1548-6435(2013)0000010012
An Introduction to Audio and Visual Research and Applications in Marketing
  • Jan 1, 2013
  • Li Xiao + 2 more

Purpose – The advancement of multimedia technology has spurred the use of multimedia in business practice. The adoption of audio and visual data will accelerate as marketing scholars become more aware of the value of audio and visual data and the technologies required to reveal insights into marketing problems. This chapter aims to introduce marketing scholars into this field of research.Design/methodology/approach – This chapter reviews the current technology in audio and visual data analysis and discusses rewarding research opportunities in marketing using these data.Findings – Compared with traditional data like survey and scanner data, audio and visual data provides richer information and is easier to collect. Given these superiority, data availability, feasibility of storage, and increasing computational power, we believe that these data will contribute to better marketing practices with the help of marketing scholars in the near future.Practical implications: The adoption of audio and visual data in marketing practices will help practitioners to get better insights into marketing problems and thus make better decisions.Value/originality – This chapter makes first attempt in the marketing literature to review the current technology in audio and visual data analysis and proposes promising applications of such technology. We hope it will inspire scholars to utilize audio and visual data in marketing research.

  • Conference Article
  • Cite Count Icon 12
  • 10.1145/3274783.3275200
Multimodal Emotion Recognition by extracting common and modality-specific information
  • Nov 4, 2018
  • Wei Zhang + 5 more

Emotion recognition technologies have been widely used in numerous areas including advertising, healthcare and online education. Previous works usually recognize the emotion from either the acoustic or the visual signal, yielding unsatisfied performances and limited applications. To improve the inference capability, we present a multimodal emotion recognition model, EMOdal. Apart from learning the audio and visual data respectively, EMOdal efficiently learns the common and modality-specific information underlying the two kinds of signals, and therefore improves the inference ability. The model has been evaluated on our large-scale emotional data set. The comprehensive evaluations demonstrate that our model outperforms traditional approaches.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 52
  • 10.3390/app12010527
Data Augmentation for Audio-Visual Emotion Recognition with an Efficient Multimodal Conditional GAN
  • Jan 5, 2022
  • Applied Sciences
  • Fei Ma + 4 more

Audio-visual emotion recognition is the research of identifying human emotional states by combining the audio modality and the visual modality simultaneously, which plays an important role in intelligent human-machine interactions. With the help of deep learning, previous works have made great progress for audio-visual emotion recognition. However, these deep learning methods often require a large amount of data for training. In reality, data acquisition is difficult and expensive, especially for the multimodal data with different modalities. As a result, the training data may be in the low-data regime, which cannot be effectively used for deep learning. In addition, class imbalance may occur in the emotional data, which can further degrade the performance of audio-visual emotion recognition. To address these problems, we propose an efficient data augmentation framework by designing a multimodal conditional generative adversarial network (GAN) for audio-visual emotion recognition. Specifically, we design generators and discriminators for audio and visual modalities. The category information is used as their shared input to make sure our GAN can generate fake data of different categories. In addition, the high dependence between the audio modality and the visual modality in the generated multimodal data is modeled based on Hirschfeld-Gebelein-Rényi (HGR) maximal correlation. In this way, we relate different modalities in the generated data to approximate the real data. Then, the generated data are used to augment our data manifold. We further apply our approach to deal with the problem of class imbalance. To the best of our knowledge, this is the first work to propose a data augmentation strategy with a multimodal conditional GAN for audio-visual emotion recognition. We conduct a series of experiments on three public multimodal datasets, including eNTERFACE’05, RAVDESS, and CMEW. The results indicate that our multimodal conditional GAN has high effectiveness for data augmentation of audio-visual emotion recognition.

  • Research Article
  • Cite Count Icon 9
  • 10.3390/math13071100
Hybrid Multi-Attention Network for Audio–Visual Emotion Recognition Through Multimodal Feature Fusion
  • Mar 27, 2025
  • Mathematics
  • Sathishkumar Moorthy + 1 more

Multimodal emotion recognition involves leveraging complementary relationships across modalities to enhance the assessment of human emotions. Networks that integrate diverse information sources outperform single-modal approaches while offering greater robustness against noisy or missing data. Current emotion recognition approaches often rely on cross-modal attention mechanisms, particularly audio and visual modalities; however, these methods do not assume the complementary nature of the data. Despite making this assumption, it is not uncommon to see non-complementary relationships arise in real-world data, reducing the effectiveness of feature integration that assumes consistent complementarity. While audio–visual co-learning provides a broader understanding of contextual information for practical implementation, discrepancies between audio and visual data, such as semantic inconsistencies, pose challenges and lay the groundwork for inaccurate predictions. In this way, they have limitations in modeling the intramodal and cross-modal interactions. In order to address these problems, we propose a multimodal learning framework for emotion recognition, called the Hybrid Multi-ATtention Network (HMATN). Specifically, we introduce a collaborative cross-attentional paradigm for audio–visual amalgamation, intending to effectively capture salient features over modalities while preserving both intermodal and intramodal relationships. The model calculates cross-attention weights by analyzing the relationship between combined feature illustrations and distinct modes. Meanwhile, the network employs the Hybrid Attention of Single and Parallel Cross-Modal (HASPCM) mechanism, comprising a single-modal attention component and a parallel cross-modal attention component, to harness complementary multimodal data and hidden features to improve representation. Additionally, these modules exploit complementary and concealed multimodal information to enhance the richness of feature representation. Finally, the efficiency of the proposed method is demonstrated through experiments on complex videos from the AffWild2 and AFEW-VA datasets. The findings of these tests show that the developed attentional audio–visual fusion model offers a cost-efficient solution that surpasses state-of-the-art techniques, even when the input data are noisy or missing modalities.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/upcon52273.2021.9667588
A review on EEG based Emotion Analysis using Machine Learning approaches
  • Nov 11, 2021
  • Tanya Sharma + 5 more

Brisk advancement of Machine Learning Algorithm with information fusion made the computers/machine able to understand human emotion, so that they can recognize and analyses the emotions. There are few ways to recognize human emotions from facial expression, behavior, speech and from physiological signals. This paper presents a review on emotion categorization, use of EEG signal for emotion recognition and use of ML techniques such as SVM, CNN, NB, LR and deep learning models such as CNN, RNN etc. for performing emotion recognition on different types of data such as visual data, audio data and text data. This paper also discusses the classical and modern methods used for processing EEG signal for emotion recognition.

  • Research Article
  • Cite Count Icon 11
  • 10.1093/sleep/33.3.281
Sleep Deprivation and Emotion Recognition
  • Mar 1, 2010
  • Sleep
  • Carmen M Schroder

Sleep Deprivation and Emotion Recognition

  • Conference Instance
  • Cite Count Icon 1
  • 10.1145/2808196
Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge
  • Oct 26, 2015
  • Michel Valstar + 4 more

It is our great pleasure to welcome you to the 5th Audio-Visual Emotion recognition Challenge (AVEC 2015), held in conjunction with the ACM Multimedia 2015. This year's challenge and associated workshop continues to push the boundaries of audio-visual emotion recognition. The first AVEC challenge posed the problem of detecting discrete emotion classes on an extremely large set of natural behaviour data. The second AVEC extended this problem to the prediction of continuous valued dimensional affect on the same set of challenging data. In its third edition, we enlarged the problem even further to include the prediction of self-reported severity of depression. The fourth edition of AVEC focused on the study of depression and affect by narrowing down the number of tasks to be used, and enriching the annotation. Finally, this year we've focused the study of affect by including physiology, along with audio-visual data, in the dataset, making the very first emotion recognition challenge that bridges across audio, video and physiological data. The mission of AVEC challenge and workshop series is to provide a common benchmark test set for individual multimodal information processing and to bring together the audio, video and -- for the first time ever -- physiological emotion recognition communities, to compare the relative merits of the three approaches to emotion recognition under well-defined and strictly comparable conditions and establish to what extent fusion of the approaches is possible and beneficial. A second motivation is the need to advance emotion recognition systems to be able to deal with naturalistic behaviour in large volumes of un-segmented, non-prototypical and non-preselected data. As you will see, these goals have been reached with the selection of this year's data and the challenge contributions. The call for participation attracted 15 submissions from Asia, Europe, Oceania and North America. The programme committee accepted 9 papers in addition to the baseline paper for oral presentation. For the challenge, no less than 48 results submissions were made by 13 teams! We hope that these proceedings will serve as a valuable reference for researchers and developers in the area of audio-visual-physiological emotion recognition and analysis. We also encourage attendees to attend the keynote presentation. This valuable and insightful talk can and will guide us to a better understanding of the state of the field, and future direction: AVEC'15 Keynote Talk -- From Facial Expression Analysis to Multimodal Mood Analysis, Pr. Roland Goecke (University of Canberra, Australia)

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon
Setting-up Chat
Loading Interface