With the support of multi-head attention, the Transformer shows remarkable results in speech emotion recognition. However, existing models still suffer from the inability to accurately locate important regions in semantic information at different time scales. To address this problem, we propose a Transformer-based network model for dynamic-static feature fusion, composed of a locally adaptive multi-head attention module and a global static attention module. The locally dynamic multi-head attention module adapts the attention window sizes and window centers of the different regions through speech samples and learnable parameters, enabling the model to adaptively discover and pay attention to valuable information embedded in speech. The global static attention module enables the model to use each element in the sequence fully and learn critical global feature information by establishing connections over the entire input sequence. We also use the data mixture training method to train our model and introduce the CENTER LOSS function to supervise the training of the model, which can better speed up the fitting speed of the model and alleviate the sample imbalance problem to a certain extent. This method achieved good performance on the IEMOCAP and MELD datasets, proving that our proposed model structure and method have better accuracy and robustness.
Read full abstract