Abstract
Continuous time-varying prediction of emotions based on speech in terms of attributes (i.e., arousal) has received considerable attention in the past few years. However, the variability introduced by factors not related to emotion, such as speaker and phonetic variability, which in turn may lead to less reliable models and less accurate emotion predictions, has not been fully explored yet. In particular, even though speaker variability has been shown to be a significant confounding factor in continuous emotion prediction systems, there remains a paucity of analyses about how speaker variability affects continuous emotion prediction systems and which methods can be applied to compensate for this variability. This paper first formulates speaker variability systematically in terms of probability distributions in both feature and model spaces, and quantifies the effect of speaker variability by comparing inter- and intra-speaker variability between speaker-dependent models. Second, two compensation techniques based on partial least squares dimensional reduction and feature mapping are proposed. Finally, the effectiveness of the proposed techniques is validated on three databases, across which they show consistent improvement in arousal, valence and dominance prediction. Additional quantitative analyse reveals that the two proposed techniques compensate for speaker variability in both the feature and model spaces simultaneously.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.