Abstract
Emotion is expressed over both speech and song. Previous works have found that although spoken and sung emotion recognition are different tasks, they are related. Classifiers that explicitly utilize this relatedness can achieve better performance than classifiers that do not. Further, research in speech emotion recognition has demonstrated that emotion is more accurately modeled when gender is taken into account. However, it is not yet clear how domain (speech or song) and gender can be jointly leveraged in emotion recognition systems nor how systems leveraging this information can perform in cross-corpus settings. In this paper, we explore a multi-task emotion recognition framework and compare the performance across different classification models and output selection/fusion methods using cross-corpus evaluation. Our results show the classification accuracy is the highest when information is shared only between closely related tasks and when the output of disparate models are fused.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have