Abstract

With the support of multi-head attention, the Transformer shows remarkable results in speech emotion recognition. However, existing models still suffer from the inability to accurately locate important regions in semantic information at different time scales. To address this problem, we propose a Transformer-based network model for dynamic-static feature fusion, composed of a locally adaptive multi-head attention module and a global static attention module. The locally dynamic multi-head attention module adapts the attention window sizes and window centers of the different regions through speech samples and learnable parameters, enabling the model to adaptively discover and pay attention to valuable information embedded in speech. The global static attention module enables the model to use each element in the sequence fully and learn critical global feature information by establishing connections over the entire input sequence. We also use the data mixture training method to train our model and introduce the CENTER LOSS function to supervise the training of the model, which can better speed up the fitting speed of the model and alleviate the sample imbalance problem to a certain extent. This method achieved good performance on the IEMOCAP and MELD datasets, proving that our proposed model structure and method have better accuracy and robustness.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.