MPAF-CNN: Multiperspective aware and fine-grained fusion strategy for speech emotion recognition

Guoyan Li,Junjie Hou,Yi Liu,Jianguo Wei

doi:10.1016/j.apacoust.2023.109658

Abstract

Speech emotion recognition (SER) is a crucial and challenging task in affective computing due to the intricacy and variability inherent in speech. In this paper, a novel method (i.e., MPAF-CNN) combines a convolutional neural network (CNN)-based multiperspective aware module (MPAM) and a frame-level fine-grained fusion strategy (FFS) for SER by utilizing speech information. MPAM perceives the emotional information embedded in speech from three main perspectives: local, frame-level, and global. Specifically, this module introduces the multiscale idea of perceiving multi-granular emotional information under different local sensory fields from the local perspective; a novel frame-level aggregated attention is proposed in this paper, aiming to learn the intrinsic emotional associations of intermediate features from the frame-level perspective, enhance the model's attention to emotionally informative frames, and improve the emotional expression of intermediate features; in the global perspective, multiple layers of global intermediate features are aggregated from the time domain, frequency domain, or channel to enhance the model's ability to extract and express global feature information. A new frame-level fine-grained fusion strategy is proposed to employ an attention mechanism to model the interaction of emotional representations from different acoustic features at the frame level, capturing their underlying relationships and thus further improving the overall performance of the model. The experimental results show that our method has excellent performance in recognizing speech emotions, and MPAF-CNN obtains 72.19% and 72.88% recognition accuracy on the IEMOCAP database.

Full Text