Towards a standard set of acoustic features for the processing of emotion in speech.

Florian Eyben,Anton Batliner,Bjoern Schuller

doi:10.1121/1.4739483

Abstract

Researchers concerned with the automatic recognition of human emotion in speech have proposed a considerable variety of segmental and supra-segmental acoustic descriptors. These range from prosodic characteristics and voice quality to acoustic correlates of articulation, and represent unequal degrees of perceptual elaboration. Recently, evidence has been reported from first comparisons on multiple speech databases that spectral and cepstral characteristics might have the greatest potential for the task. Yet, novel acoustic correlates are constantly proposed, as the question of the optimal representation remains disputed. The task of evaluating suggested correlates is non-trivial, as no agreed "standard" set and method of assessment exists, and inter-corpus substantiation is usually lacking. Such substantiation is particularly difficult owing to the divergence of models employed for the ground-truth description of emotion. To ease this challenge, using the arousal-valence space as the predominant means for mapping information stemming from diverse speech resources, including acted and spontaneous speech with variable and fixed phonetic content on well-defined binary tasks is proposed. The acoustic baseline feature sets of all six past emotion and paralinguistics challenges are evaluted systematically on eight standard speech-emotion corpora in order to asses the power of each feature set for different types of data.

Full Text