Abstract

Traditional discrete-time Speech Emotion Recognition (SER) modelling techniques typically assume that an entire speaker chunk or turn is indicative of its corresponding label. An alternative approach is to assume emotional saliency varies over the course of a speaker turn and use modelling techniques capable of identifying and utilising the most emotionally salient segments, such as those with higher emotional intensity. This strategy has the potential to improve the accuracy of SER systems. Towards this goal, we developed a novel hierarchical recurrent neural network model that produces turn level embeddings for SER. Specifically, we apply two levels of attention to learn to identify salient emotional words in a turn as well as the more informative frames within these words. In a set of experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) database, we demonstrate that component-attention is more effective within our hierarchical framework than both standard soft-attention and conventional local-attention. Our best network, a hierarchical component-attention network with an attention scope of seven, achieved an Unweighted Average Recall (UAR) of 65.0 % and a Weighted Average Recall (WAR) of 66.1 %, outperforming other baseline attention approaches on the IEMOCAP database.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.