Hierarchical Component-attention Based Speaker Turn Embedding for Emotion Recognition

Shuo Liu,Ziping Zhao,Nicholas Cummins,Jinlong Jiao,Judith Dineley,Bjorn Schuller

doi:10.1109/ijcnn48605.2020.9207374

Shuo Liu, Ziping Zhao + Show 4 more

Open Access

https://doi.org/10.1109/ijcnn48605.2020.9207374

Copy DOI

Publication Date: Jul 1, 2020
Citations: 26	License type: other-oa

Affiliation: University of Augsburg, Tianjin Normal University

Abstract

Traditional discrete-time Speech Emotion Recognition (SER) modelling techniques typically assume that an entire speaker chunk or turn is indicative of its corresponding label. An alternative approach is to assume emotional saliency varies over the course of a speaker turn and use modelling techniques capable of identifying and utilising the most emotionally salient segments, such as those with higher emotional intensity. This strategy has the potential to improve the accuracy of SER systems. Towards this goal, we developed a novel hierarchical recurrent neural network model that produces turn level embeddings for SER. Specifically, we apply two levels of attention to learn to identify salient emotional words in a turn as well as the more informative frames within these words. In a set of experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, we demonstrate that component-attention is more effective within our hierarchical framework than both standard soft-attention and conventional local-attention. Our best network, a hierarchical component-attention network with an attention scope of seven, achieved an Unweighted Average Recall (UAR) of 65.0 % and a Weighted Average Recall (WAR) of 66.1 %, outperforming other baseline attention approaches on the IEMOCAP database.

Full Text