Rethinking Auditory Affective Descriptors Through Zero-Shot Emotion Recognition in Speech

Xinzhou Xu,Bjorn W Schuller,Laurence Devillers,Xijian Fan,Li Zhao,Zixing Zhang,Jun Deng

doi:10.1109/tcss.2021.3130401

Abstract

Zero-shot speech emotion recognition (SER) endows machines with the ability of sensing unseen-emotional states in speech, compared with conventional SER endeavors on supervised cases. On addressing the zero-shot SER task, auditory affective descriptors (AADs) are typically employed to transfer affective knowledge from seen- to unseen-emotional states. However, it remains unknown which types of AADs can well describe emotional states in speech during the transfer. In this regard, we define and research on three types of AADs, namely, per-emotion semantic-embedding, per-emotion manually annotated, and per-sample manually annotated AADs, through zero-shot emotion recognition in speech. This leads to a systematic design including prototype- and annotation-based zero-shot SER modules, relying on the input from per-emotion and per-sample AADs, respectively. We then perform extensive experimental comparisons between human and machines’ AADs on the French emotional speech corpus CINEMO for positive-negative (PN) and within-negative (WN) tasks. The experimental results indicate that semantic-embedding prototypes from pretrained models can outperform manually annotated emotional dimensions in zero-shot SER. The results further demonstrate that it is possible for machines to understand and describe affective information in speech better than human beings, with the help of sufficient pretrained models.

Full Text