In speech emotion recognition (SER), our research addresses the critical challenges of capturing and evaluating node information and their complex interrelationships within speech data. We introduce Skip Graph Convolutional and Graph Attention Network (SkipGCNGAT), an innovative model that combines the strengths of skip graph convolutional networks (SkipGCNs) and graph attention networks (GATs) to address these challenges. SkipGCN incorporates skip connections, enhancing the flow of information across the network and mitigating issues such as vanishing gradients, while also facilitating deeper representation learning. Meanwhile, the GAT in the model assigns dynamic attention weights to neighboring nodes, allowing SkipGCNGAT to focus on both the most relevant local and global interactions within the speech data. This enables the model to capture subtle and complex dependencies between speech segments, thus facilitating a more accurate interpretation of emotional content. It overcomes the limitations of previous single-layer graph models, which were unable to effectively represent these intricate relationships across time and in different speech contexts. Additionally, by introducing a pre-pooling SkipGCN combination technique, we further enhance the ability of the model to integrate multi-layer information before pooling, improving its capacity to capture both spatial and temporal features in speech. Furthermore, we rigorously evaluated SkipGCNGAT on the IEMOCAP and MSP-IMPROV datasets, two benchmark datasets in SER. The results demonstrated that SkipGCNGAT consistently achieved state-of-the-art performance. These findings highlight the effectiveness of the proposed model in accurately recognizing emotions in speech, offering valuable insights and a solid foundation for future research on capturing complex relationships within speech signals for emotion recognition.
Read full abstract