Abstract

Although Speech Emotion Recognition (SER) is becoming an active research area, the state-of-the-art performance is limited by the scarcity of emotional datasets. The introduction of Multi-Task Learning (MTL) which jointly trains multiple relative tasks, can utilize more emotion-related information within limited data, enhancing the SER performance but also increasing the training difficulty. In this paper, we propose a primary-task driven adaptive loss function named as PTDA-Loss to adaptively adjust the task weights in the loss function, aiming to more reasonably allocate the learning resources in the training process and further improve the SER performance. We observe that excessive concern with the auxiliary task can lead to insufficient learning resources for SER, not conducive to SER performance improvement. Inspired by this finding, in PTDA-Loss, we retain the advantage of SER in the whole training process. Meanwhile, the learning resources for the auxiliary task will be appropriately but not over compressed, ensuring the effective learning of useful information from the auxiliary task. To improve the model’s learning ability, furthermore, we first employ several dynamic convolutional layers as the backbone network. Regarding gender recognition and speaker recognition as the auxiliary task respectively, ablation experiments on the IEMOCAP, MSP-IMPROV and MELD datasets prove the rationality and effectiveness of PTDA-Loss. Additionally, comparative experiments demonstrate that our method achieves state-of-the-art performance on the three datasets, further enhancing the SER accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call