Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

EAVS-TF: A bio-inspired spiking neural network for energy-efficient multimodal emotion recognition

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

EAVS-TF: A bio-inspired spiking neural network for energy-efficient multimodal emotion recognition

Similar Papers
  • Research Article
  • Cite Count Icon 53
  • 10.1109/tnnls.2024.3367940
DER-GCN: Dialog and Event Relation-Aware Graph Convolutional Neural Network for Multimodal Dialog Emotion Recognition.
  • Mar 1, 2025
  • IEEE transactions on neural networks and learning systems
  • Wei Ai + 3 more

With the continuous development of deep learning (DL), the task of multimodal dialog emotion recognition (MDER) has recently received extensive research attention, which is also an essential branch of DL. The MDER aims to identify the emotional information contained in different modalities, e.g., text, video, and audio, and in different dialog scenes. However, the existing research has focused on modeling contextual semantic information and dialog relations between speakers while ignoring the impact of event relations on emotion. To tackle the above issues, we propose a novel dialog and event relation-aware graph convolutional neural network (DER-GCN) for multimodal emotion recognition method. It models dialog relations between speakers and captures latent event relations information. Specifically, we construct a weighted multirelationship graph to simultaneously capture the dependencies between speakers and event relations in a dialog. Moreover, we also introduce a self-supervised masked graph autoencoder (SMGAE) to improve the fusion representation ability of features and structures. Next, we design a new multiple information Transformer (MIT) to capture the correlation between different relations, which can provide a better fuse of the multivariate information between relations. Finally, we propose a loss optimization strategy based on contrastive learning to enhance the representation learning ability of minority class features. We conduct extensive experiments on the benchmark datasets, Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Multimodal EmotionLines Dataset (MELD), which verify the effectiveness of the DER-GCN model. The results demonstrate that our model significantly improves both the average accuracy and the value of emotion recognition. Our code is publicly available at https://github.com/yuntaoshou/DER-GCN.

  • Research Article
  • 10.1016/j.eswa.2026.132002
AMB-DSGDN: Adaptive modality-balanced dynamic semantic graph differential network for multimodal emotion recognition
  • Jul 1, 2026
  • Expert Systems with Applications
  • Yunsheng Wang + 5 more

• We propose AMB-DSGDN to capture dynamic emotions via differential subgraph attention. • We design differential attention to denoise and highlight modality-specific cues. • We propose adaptive dropout to balance modalities by reducing dominance and rescaling. • Experiments on IEMOCAP and MELD show AMB-DSGDN’s superior robust performance. Multimodal dialogue emotion recognition captures emotional cues by fusing text, visual, and audio modalities. However, existing approaches still suffer from notable limitations in modeling emotional dependencies and learning multimodal representations. On the one hand, they are unable to effectively filter out redundant or noisy signals within multimodal features, which hinders the accurate capture of the dynamic evolution of emotional states across and within speakers. On the other hand, during multimodal feature learning, dominant modalities (e.g., textual cues) tend to overwhelm the fusion process, thereby suppressing the complementary contributions of non-dominant modalities such as speech and vision, ultimately constraining the overall recognition performance. To address these challenges, we propose an Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network (AMB-DSGDN). Concretely, we first construct modality-specific subgraphs for text, speech, and vision, where each modality contains intra-speaker and inter-speaker graphs to capture both self-continuity and cross-speaker emotional dependencies. On top of these subgraphs, we introduce a differential graph attention mechanism, which computes the discrepancy between two sets of attention maps. By explicitly contrasting these attention distributions, the mechanism cancels out shared noise patterns while retaining modality-specific and context-relevant signals, thereby yielding purer and more discriminative emotional representations. In addition, we design an adaptive modality balancing mechanism, which estimates a dropout probability for each modality according to its relative contribution in emotion modeling. This mechanism randomly discards a portion of features from dominant modalities to suppress their overwhelming influence, while proportionally rescaling the preserved features based on the dropout probability to maintain overall information balance. Extensive experiments on IEMOCAP and MELD datasets validate that AMB-DSGDN significantly outperforms state-of-the-art baselines, demonstrating its effectiveness and robustness in multimodal conversational emotion recognition.

  • Conference Article
  • Cite Count Icon 1
  • 10.18653/v1/2025.acl-long.102
ECERC: Evidence-Cause Attention Network for Multi-Modal Emotion Recognition in Conversation
  • Jan 1, 2025
  • Tao Zhang + 1 more

Multi-modal Emotion Recognition in Conversation (MMERC) aims to identify speakers' emotional states using multi-modal conversational data, significant for various domains.MMERC requires addressing emotional causes: contextual factors that influence emotions, alongside emotional evidence directly expressed in the target utterance.Existing methods primarily model general conversational dependencies, such as sequential utterance relationships or inter-speaker dynamics, but fall short in capturing diverse and detailed emotional causes, including emotional contagion, influences from others, and self-referenced or externally introduced events.To address these limitations, we propose the Evidence-Cause Attention Network for Multi-Modal Emotion Recognition in Conversation (ECERC).ECERC integrates emotional evidence with contextual causes through five stages: Evidence Gating extracts and refines emotional evidence across modalities; Cause Encoding captures causes from conversational context; Evidence-Cause Interaction uses attention to integrate evidence with diverse causes, generating rich candidate features for emotion inference; Feature Gating adaptively weights contributions of candidate features; and Emotion Classification classifies emotions.We evaluate ECERC on two widely used benchmark datasets, IEMOCAP and MELD.Experimental results show that ECERC achieves competitive performance in weighted F1-score and accuracy, demonstrating its effectiveness in MMERC 1 .

  • Research Article
  • Cite Count Icon 57
  • 10.1016/j.cie.2022.108078
Multi-head attention fusion networks for multi-modal speech emotion recognition
  • Mar 10, 2022
  • Computers & Industrial Engineering
  • Junfeng Zhang + 4 more

Multi-head attention fusion networks for multi-modal speech emotion recognition

  • Research Article
  • Cite Count Icon 2
  • 10.3390/s26030807
CV-EEGNet: A Compact Complex-Valued Convolutional Network for End-to-End EEG-Based Emotion Recognition
  • Jan 26, 2026
  • Sensors (Basel, Switzerland)
  • Wenhao Wang + 6 more

In electroencephalogram (EEG)-based emotion recognition tasks, existing end-to-end approaches predominantly rely on real-valued neural networks, which mainly operate in the time–amplitude domain. However, EEG signals are a type of wave, intrinsically including frequency, phase, and amplitude characteristics. Real-valued architectures may struggle to capture amplitude–phase coupling and spectral structures that are crucial for emotion decoding. To the best of our knowledge, this work is the first to introduce complex-valued neural networks for EEG-based emotion recognition, upon which we design a new end-to-end architecture named Complex-valued EEGNet (CV-EEGNet). Beginning with raw EEG signals, CV-EEGNet transforms them into complex-valued spectra via the Fast Fourier Transform, then sequentially applies complex-valued spectral, spatial, and depthwise-separable convolution modules to extract frequency structures, spatial topologies, and high-level semantic representations while preserving amplitude–phase relationships. Finally, a complex-valued, fully connected classifier generates complex logits, and the final emotion predictions are derived from their magnitudes. Experiments on the SEED (three-class) and SEED-IV (four-class) datasets validate the effectiveness of the proposed method, with t-SNE visualizations further confirming the discriminability of the learned representations. These results show the potential of complex-valued neural networks for raw-signal EEG emotion recognition.

  • Research Article
  • Cite Count Icon 18
  • 10.1016/j.bspc.2023.104661
Stochastic weight averaging enhanced temporal convolution network for EEG-based emotion recognition
  • Feb 7, 2023
  • Biomedical Signal Processing and Control
  • Lijun Yang + 3 more

Stochastic weight averaging enhanced temporal convolution network for EEG-based emotion recognition

  • Research Article
  • Cite Count Icon 24
  • 10.1109/tetci.2024.3406422
Comprehensive Multisource Learning Network for Cross-Subject Multimodal Emotion Recognition
  • Feb 1, 2025
  • IEEE Transactions on Emerging Topics in Computational Intelligence
  • Chuangquan Chen + 6 more

Electroencephalography (EEG) signals and eye movement signals, which represent internal physiological responses and external subconscious behaviors, respectively, have been shown to be reliable indicators for recognizing emotions. However, integrating these two modalities across multiple subjects presents several challenges: 1) designing a robust consistency metric that balances the consistency and divergences between heterogeneous modalities across multiple subjects; 2) simultaneously considering intra-modality and inter-modality information across multiple subjects; and 3) overcoming individual differences among multiple subjects and generating subject-invariant representations of the multimodal fused features. To address these challenges associated with multisource data (i.e., multiple modalities and subjects), we propose a novel comprehensive multisource learning network (CMSLNet) for cross-subject multimodal emotion recognition. Specifically, an instance-level adaptive robust consistency metric is first designed to better align the information between EEG signals and eye movement signals, identifying their consistency and divergences across various emotions. Subsequently, an attentive low-rank multimodal fusion (Att-LMF) method is developed to account for individual differences and dynamically learn intra-modality and inter-modality information, resulting in highly discriminative fused features. Finally, domain generalization is utilized to extract subject-invariant representations of the fused features, thus adapting to new subjects and enhancing the model's generalization. Through these elaborate designs, CMSLNet effectively incorporates the information from multisource data, thus significantly improving the accuracy and reliability of emotion recognition. Extensive experiments on two public datasets demonstrate the superior performance of CMSLNet. CMSLNet achieves high accuracies of 83.15% on the SEED-IV dataset and 87.32% on the SEED-V dataset, surpassing the state-of-the-art methods by 3.62% and 4.60%, respectively.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.3390/app12115436
Domain Adversarial Network for Cross-Domain Emotion Recognition in Conversation
  • May 27, 2022
  • Applied Sciences
  • Hongchao Ma + 4 more

Emotion Recognition in Conversation (ERC) aims to recognize the emotion for each utterance in a conversation automatically. Due to the difficulty of collecting and labeling, this task lacks the dataset corpora available on a large scale. This increases the difficulty of finishing the supervised training required by large-scale neural networks. Introducing the large-scale generative conversational dataset can assist with modeling dialogue. However, the spatial distribution of feature vectors in the source and target domains is inconsistent after introducing the external dataset. To alleviate the problem, we propose a Domain Adversarial Network for Cross-Domain Emotion Recognition in Conversation (DAN-CDERC) model, consisting of domain adversarial and emotion recognition models. The domain adversarial model consists of the encoders, a generator and a domain discriminator. First, the encoders and generator learn contextual features from a large-scale source dataset. The discriminator performs domain adaptation by discriminating the domain to make the feature space of the source and target domain consistent, so as to obtain domain invariant features. Then DAN-CDERC transfers the learned domain invariant dialogue context knowledge from the domain adversarial model to the emotion recognition model to assist in modeling the dialogue context. Due to the use of a domain adversarial network, DAN-CDERC obtains dialogue-level contextual information that is domain invariant, thereby reducing the negative impact of inconsistency in domain space. Empirical studies illustrate that the proposed model outperforms the baseline models on three benchmark emotion recognition datasets.

  • Research Article
  • 10.3390/s26092859
UDC-SNN: An Uncertainty-Aware Dynamic Cascading Framework with Spiking Neural Network for Balancing Performance and Energy in Multimodal Emotion Recognition
  • May 3, 2026
  • Sensors (Basel, Switzerland)
  • Guihao Ran + 5 more

The aim of this study is to propose an uncertainty-aware dynamic cascading framework based on spiking neural network (UDC-SNN) for multimodal emotion recognition, particularly to address the inherent trade-off between recognition performance and energy efficiency. An asymmetric dynamic routing mechanism was proposed to enable demand-driven activation of the high-power electroencephalogram (EEG) branch, coupled with preliminary inference on a low-power electrocardiogram (ECG) branch and uncertainty quantification via Shannon entropy. Meanwhile, a parameter-free log-linear aggregation strategy was developed to transform modality-specific entropy into dynamic Bayesian weights through an exponential decay function, effectively mitigating the negative transfer effects induced by unimodal noise. The UDC-SNN was evaluated on the multimodal affective dataset DREAMER, comprising 23 subjects (170,660 segments). The averaged recognition accuracy and energy consumption across the three dimensions of valence, arousal, and dominance were 90.75% and 4.62 J, respectively. The obtained results suggest that the proposed framework could potentially achieve a favorable balance between high emotion recognition and low energy consumption, thereby establishing its applicability for real-time monitoring in resource-constrained scenarios.

  • Conference Article
  • Cite Count Icon 416
  • 10.1145/2818346.2830596
Recurrent Neural Networks for Emotion Recognition in Video
  • Nov 9, 2015
  • Samira Ebrahimi Kahou + 4 more

Deep learning based approaches to facial analysis and video analysis have recently demonstrated high performance on a variety of key tasks such as face recognition, emotion recognition and activity recognition. In the case of video, information often must be aggregated across a variable length sequence of frames to produce a classification result. Prior work using convolutional neural networks (CNNs) for emotion recognition in video has relied on temporal averaging and pooling operations reminiscent of widely used approaches for the spatial aggregation of information. Recurrent neural networks (RNNs) have seen an explosion of recent interest as they yield state-of-the-art performance on a variety of sequence analysis tasks. RNNs provide an attractive framework for propagating information over a sequence using a continuous valued hidden layer representation. In this work we present a complete system for the 2015 Emotion Recognition in the Wild (EmotiW) Challenge. We focus our presentation and experimental analysis on a hybrid CNN-RNN architecture for facial expression analysis that can outperform a previously applied CNN approach using temporal averaging for aggregation.

  • Research Article
  • Cite Count Icon 48
  • 10.1016/j.specom.2023.103010
Multiscale-multichannel feature extraction and classification through one-dimensional convolutional neural network for Speech emotion recognition
  • Nov 22, 2023
  • Speech Communication
  • Minying Liu + 5 more

Multiscale-multichannel feature extraction and classification through one-dimensional convolutional neural network for Speech emotion recognition

  • Research Article
  • Cite Count Icon 24
  • 10.1016/j.jneumeth.2022.109624
Spatial-frequency-temporal convolutional recurrent network for olfactory-enhanced EEG emotion recognition
  • May 16, 2022
  • Journal of neuroscience methods
  • Mengxia Xing + 3 more

Spatial-frequency-temporal convolutional recurrent network for olfactory-enhanced EEG emotion recognition

  • Research Article
  • Cite Count Icon 1
  • 10.3390/brainsci15121343
Temporal Capsule Feature Network for Eye-Tracking Emotion Recognition
  • Dec 18, 2025
  • Brain Sciences
  • Qingfeng Gu + 5 more

Eye Tracking (ET) parameters, as physiological signals, are widely applied in emotion recognition and show promising performance. However, emotion recognition relying on ET parameters still faces several challenges: (1) insufficient extraction of temporal dynamic information from the ET parameters; (2) a lack of sophisticated features with strong emotional specificity, which restricts the model’s robustness and individual generalization capability. To address these issues, we propose a novel Temporal Capsule Feature Network (TCFN) for ET parameter-based emotion recognition. The network incorporates a Window Feature Module to extract Eye Movement temporal dynamic information and a specialized Capsule Network Module to mine complementary and collaborative relationships among features. The MLP Classification Module realizes feature-to-category conversion, and a Dual-Loss Mechanism is integrated to optimize overall performance. Experimental results demonstrate the superiority of the proposed model: the average accuracy reaches 83.27% for Arousal and 89.94% for Valence (three-class tasks) on the eSEE-d dataset, and the accuracy rate of four-category across-session emotion recognition is 63.85% on the SEED-IV dataset.

  • Research Article
  • 10.1007/s11571-025-10399-8
C2DGCN: cross-connected distributive learning-enabled graph convolutional network for human emotion recognition using electroencephalography signal.
  • Dec 26, 2025
  • Cognitive neurodynamics
  • Puja Cholke + 5 more

Emotion Recognition generally involves the identification of the present mental state or psychological conditions of the human while interacting with others. Among the various modalities, Electroencephalography is the most deceptive emotion recognition technique because of its ability to characterize brain activities accurately. Several emotion recognition methods have been designed utilizing Deep Learning approaches from EEG signals. Yet, their inability to capture the complex features and the occurrence of the overfitting problems with increased computational complexity affected their extensive application. Therefore, this research proposes the Cross-Connected Distributive Learning-enabled Graph Convolutional Network (C2DGCN) for effective emotion recognition. Specifically, the cross-connected distributive learning in the C2DGCN enables extensive feature sharing and integration, thus reducing the computation complexity and improving the accuracy. Further, the application of the Statistical Time-Frequency Signal descriptor aids in the extraction of complex features and mitigates the overfitting issue. The experimental validation revealed the effectiveness of the C2DGCN by achieving a high accuracy of 97.73%, sensitivity of 98.32%, specificity of 98.22%, and precision of 98.32% with 90% of training using the SEED-IV dataset. For the evaluation using the DEAP dataset, the proposed C2DGCN model reaches an accuracy of 97.66%, precision of 97.98%, sensitivity of 97.25%, and specificity of 98.07%.

  • Research Article
  • Cite Count Icon 55
  • 10.1016/j.knosys.2023.111199
Functional connectivity-enhanced feature-grouped attention network for cross-subject EEG emotion recognition
  • Nov 14, 2023
  • Knowledge-Based Systems
  • Wenhui Guo + 4 more

Functional connectivity-enhanced feature-grouped attention network for cross-subject EEG emotion recognition

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant