Abstract

One of the main barriers in the deployment of speech emotion recognition systems in real applications is the lack of generalization of the emotion classifiers. The recognition performance achieved in controlled recordings drops when the models are tested with different speakers, channels, environments and domain conditions. This paper explores supervised model adaptation, which can improve the performance of systems evaluated with mismatched training and testing conditions. We address the following key questions in the context of supervised adaptation for speech emotion recognition: (a) how much labeled data is needed for adaptation to achieve good performance? (b) how important is speaker diversity in the labeled set? (c) can spontaneous acted data provide similar performance than naturalistic non-acted recordings? and (d) what is the best approach to adapt the models (domain adaptation versus incremental/online training)? We address these problems by using a multi-corpus framework where the models are trained and tested with different databases. The results indicate that even small portion of data used for adaptation can significantly improve the performance. Increasing the speaker diversity in the labeled data used for adaptation does not provide significant gain in performance. Also, we observe similar performance when the classifiers are trained with naturalistic non-acted data and spontaneous acted data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call