On the acoustics of emotion in speech: Desperately seeking a standard.

Bjoern Schuller

doi:10.1121/1.3385168

Abstract

Researchers concerned with automatic recognition of human emotion in speech have proposed a considerable variety of segmental and supra-segmental acoustic descriptors. These range from prosodic characteristics to voice quality to acoustic correlates of articulation and represent unequal degrees of perceptual elaboration. Recently, evidence has been reported from first comparisons on multiple speech databases that spectral and cepstral characteristics have the greatest potential for the task [B. Schuller et al., Linguistic Insights 97, 285–307 (2009)]. Yet, novel acoustic correlates are constantly proposed, as the question of the optimal representation remains disputed. The task of evaluating suggested correlates is non-trivial, as no agreed “standard” set and method of assessment exists, and inter-corpus substantiation is usually lacking. Such substantiation is particularly difficult owing to the divergence of models employed for the ground-truth description of emotion. To ease this challenge, using the potency-arousal-valence space as the predominant means for mapping information stemming from diverse speech resources, including acted and spontaneous speech with variable and fixed phonetic content on well-defined binary tasks is proposed. Among the various options for automatic classification, a method combining static and dynamic features representing pitch, intensity, duration, voice quality, and cepstral attributes is recommended.

Full Text