Sequential Modeling by Leveraging Non-Uniform Distribution of Speech Emotion

Wei-Cheng Lin,Carlos Busso

doi:10.1109/taslp.2023.3244527

Abstract

The expression and perception of human emotions are not uniformly distributed over time. Therefore, tracking local changes of emotion within a segment can lead to better models for <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">speech emotion recognition</i> (SER), even when the task is to provide a sentence-level prediction of the emotional content. A challenge to exploring local emotional changes within a sentence is that most existing emotional corpora only provide sentence-level annotations (i.e., one label per sentence). This labeling approach is not appropriate for leveraging the dynamic emotional trends within a sentence. We propose a framework that splits a sentence into a fixed number of chunks, generating chunk-level emotional patterns. The approach relies on emotion rankers to unveil the emotional pattern within a sentence, creating continuous emotional curves. Our approach trains the sentence-level SER model with a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">sequence-to-sequence</i> formulation by leveraging the retrieved emotional curves. The proposed method achieves the best <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">concordance correlation coefficient</i> (CCC) prediction performance for arousal (0.7120), valence (0.3125), and dominance (0.6324) on the MSP-Podcast corpus. In addition, we validate the approach with experiments on the IEMOCAP and MSP-IMPROV databases. We further compare the retrieved curves with time-continuous emotional traces. The evaluation demonstrates that these retrieved chunk-label curves can effectively capture emotional trends within a sentence, displaying a time-consistency property that is similar to time-continuous traces annotated by human listeners. The proposed SER model learns meaningful, complementary, local information that contributes to the improvement of sentence-level predictions of emotional attributes.

Full Text