Emotional Speech Research Articles

In speech emotion recognition (SER), our research addresses the critical challenges of capturing and evaluating node information and their complex interrelationships within speech data. We introduce Skip Graph Convolutional and Graph Attention Network (SkipGCNGAT), an innovative model that combines the strengths of skip graph convolutional networks (SkipGCNs) and graph attention networks (GATs) to address these challenges. SkipGCN incorporates skip connections, enhancing the flow of information across the network and mitigating issues such as vanishing gradients, while also facilitating deeper representation learning. Meanwhile, the GAT in the model assigns dynamic attention weights to neighboring nodes, allowing SkipGCNGAT to focus on both the most relevant local and global interactions within the speech data. This enables the model to capture subtle and complex dependencies between speech segments, thus facilitating a more accurate interpretation of emotional content. It overcomes the limitations of previous single-layer graph models, which were unable to effectively represent these intricate relationships across time and in different speech contexts. Additionally, by introducing a pre-pooling SkipGCN combination technique, we further enhance the ability of the model to integrate multi-layer information before pooling, improving its capacity to capture both spatial and temporal features in speech. Furthermore, we rigorously evaluated SkipGCNGAT on the IEMOCAP and MSP-IMPROV datasets, two benchmark datasets in SER. The results demonstrated that SkipGCNGAT consistently achieved state-of-the-art performance. These findings highlight the effectiveness of the proposed model in accurately recognizing emotions in speech, offering valuable insights and a solid foundation for future research on capturing complex relationships within speech signals for emotion recognition.

In the conventional approach to speech emotion recognition (SER), the classifier is usually trained on acted emotional speech data to predict individual basic emotions. In this work, we extend the SER systems with the realistic assumption of the coexistence of multiple basic emotions in an utterance. We utilize the MSP-Podcast database for developing the SER system, which contains spontaneous speech utterances. From the primary and secondary emotion annotations of this database, we organize six prototypical and seven non-prototypical emotions. We then propose the mixed-emotion model and express the non-prototypical emotions as a linear combination of prototypical emotions. The combination weights of the mixed-emotion model are computed using a normalized dominance scale based algorithm inspired by the integration of basic emotion theory and dimensional emotion theory in human psychology. We first train a prototypical SER model using the ECAPA-TDNN architecture. The softmax predictions from this model serve as emotion profile inputs to the mixed-emotion model, which then predicts the non-prototypical emotions. Assuming the coexistence of multiple emotions, we only apply the utterances with uniform emotion profiles to the mixed-emotion model. The developed system continues with the conventional SER model if the emotion profile tends to delta function owing to the probable occurrence of a single prototypical emotion. We develop the proposed mixed-emotion model based SER framework using MFCC and wav2vec 2.0 extracted features. Further, we show that due to human perception variations, there exist prominent annotation variations in the non-prototypical emotion ground truths. To address that, we extend the supervised evaluation protocols in four different formulations that capture the subjective variability at different levels. The proposed system shows a best-case performance improvement of 7.10% and 8.39% over the conventional prototypical SER model for the MFCC and wav2vec 2.0 features, respectively.

Emotional Speech Research Articles

Related Topics

Articles published on Emotional Speech

Enhanced Speech Emotion Recognition Using Conditional-DCGAN-Based Data Augmentation

Graph Neural Network-Based Speech Emotion Recognition: A Fusion of Skip Graph Convolutional Networks and Graph Attention Networks

CNN-Based Models for Emotion and Sentiment Analysis Using Speech Data

Research on Speech Emotion Recognition Method Based on ResSE_CNN1D

Emotion Classification from Speech Waveform Using Machine Learning and Deep Learning Techniques

Classification of Infant Crying Sounds Using SE-ResNet-Transformer

ESERNet: Learning spectrogram structure relationship for effective speech emotion recognition with swin transformer in classroom discourse analysis

DSTM: A transformer-based model with dynamic-static feature fusion in speech emotion recognition

Graph-based multi-Feature fusion method for speech emotion recognition

Data augmentation using a 1D-CNN model with MFCC/MFMC features for speech emotion recognition

Multimodal speech emotion recognition optimization using genetic algorithm

Cross-feature fusion speech emotion recognition based on attention mask residual network and Wav2vec 2.0

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Improved ShuffleNet V2 network with attention for speech emotion recognition

Extending speech emotion recognition systems to non-prototypical emotions using mixed-emotion model

Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

CENN: Capsule-enhanced neural network with innovative metrics for robust speech emotion recognition

A Combined CNN Architecture for Speech Emotion Recognition.

Emotion Recognition Through Analysis of Speech – A Review

Fusion of PCA and ICA in Statistical Subset Analysis for Speech Emotion Recognition.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Emotional Speech Research Articles

Related Topics

Articles published on Emotional Speech

Enhanced Speech Emotion Recognition Using Conditional-DCGAN-Based Data Augmentation

Graph Neural Network-Based Speech Emotion Recognition: A Fusion of Skip Graph Convolutional Networks and Graph Attention Networks

CNN-Based Models for Emotion and Sentiment Analysis Using Speech Data

Research on Speech Emotion Recognition Method Based on ResSE_CNN1D

Emotion Classification from Speech Waveform Using Machine Learning and Deep Learning Techniques

Classification of Infant Crying Sounds Using SE-ResNet-Transformer

ESERNet: Learning spectrogram structure relationship for effective speech emotion recognition with swin transformer in classroom discourse analysis

DSTM: A transformer-based model with dynamic-static feature fusion in speech emotion recognition

Graph-based multi-Feature fusion method for speech emotion recognition

Data augmentation using a 1D-CNN model with MFCC/MFMC features for speech emotion recognition

Multimodal speech emotion recognition optimization using genetic algorithm

Cross-feature fusion speech emotion recognition based on attention mask residual network and Wav2vec 2.0

MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition

Improved ShuffleNet V2 network with attention for speech emotion recognition

Extending speech emotion recognition systems to non-prototypical emotions using mixed-emotion model

Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning

CENN: Capsule-enhanced neural network with innovative metrics for robust speech emotion recognition

A Combined CNN Architecture for Speech Emotion Recognition.

Emotion Recognition Through Analysis of Speech – A Review

Fusion of PCA and ICA in Statistical Subset Analysis for Speech Emotion Recognition.