Generative Listener EEG for Speech Emotion Recognition Using Generative Adversarial Networks with Compressed Sensing.

Jiang Chang,Zhixin Zhang,Zelin Wang,Pan Lin,Jiacheng Li,Linsheng Meng

doi:10.1109/jbhi.2024.3360151

Abstract

Currently, emotional features in speech emotion recognition are typically extracted from the speeches, However, recognition accuracy can be influenced by factors such as semantics, language, and cross-speech datasets. Achieving consistent emotional judgment with human listeners is a key challenge for AI to address. Electroencephalography (EEG) signals prove to be an effective means of capturing authentic and meaningful emotional information in humans. This positions EEG as a promising tool for detecting emotional cues conveyed in speech. In this study, we proposed a novel approach named CS-GAN that generates listener EEGs in response to a speaker's speech, specifically aimed at enhancing cross-subject emotion recognition. We utilized generative adversarial networks (GANs) to establish a mapping relationship between speech and EEGs to generate stimulus-induced EEGs. Furthermore, we integrated compressive sensing theory (CS) into the GAN-based EEG generation method, thereby enhancing the fidelity and diversity of the generated EEGs. The generated EEGs were then processed using a CNN-LSTM model to identify the emotional categories conveyed in the speech. By averaging these EEGs, we obtained the event-related potentials (ERPs) to improve the cross-subject capability of the method. The experimental results demonstrate that the generated EEGs by this method outperform real listener EEGs by 9.31% in cross-subject emotion recognition tasks. Furthermore, the ERPs show an improvement of 43.59%, providing evidence for the effectiveness of this method in cross-subject emotion recognition.

Full Text