Abstract

Continuous time-varying prediction of emotions based on speech in terms of attributes (i.e., arousal) has received considerable attention in the past few years. However, the variability introduced by factors not related to emotion, such as speaker and phonetic variability, which in turn may lead to less reliable models and less accurate emotion predictions, has not been fully explored yet. In particular, even though speaker variability has been shown to be a significant confounding factor in continuous emotion prediction systems, there remains a paucity of analyses about how speaker variability affects continuous emotion prediction systems and which methods can be applied to compensate for this variability. This paper first formulates speaker variability systematically in terms of probability distributions in both feature and model spaces, and quantifies the effect of speaker variability by comparing inter- and intra-speaker variability between speaker-dependent models. Second, two compensation techniques based on partial least squares dimensional reduction and feature mapping are proposed. Finally, the effectiveness of the proposed techniques is validated on three databases, across which they show consistent improvement in arousal, valence and dominance prediction. Additional quantitative analyse reveals that the two proposed techniques compensate for speaker variability in both the feature and model spaces simultaneously.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call