Estimation of emotional arousal from speech with phase-based features

Igor Guoth,Sakhia Darjaa,Roman Jarina,Milan Rusko,Marian Trnka,Marian Ritomský

doi:10.1121/1.5036131

Abstract

The most commonly adopted approaches in speech emotion recognition (SER) utilize magnitude spectrum and nonlinear Teager energy operator (TEO) based features while information about phase spectrum is often omitted. The information about phase has been frequently overlooked in approaches applied by speech processing researchers due to the signal processing difficulties. We present study of two phase-based features: The relative phase shift (RPS) based features and modified group delay features (MODGDF) that represents phase structure of speech in the task of emotional arousal recognition. The evaluation is performed on the CRISIS acted speech database which allows us to recognize five levels of emotional arousal from speech. To exploit these features, we employ concept of deep neural network. The efficiency of the approaches based on features mentioned earlier is compared to baseline platform using Mel frequency cepstral coefficients (MFCCs) and all pole group delay features (APGD). The combination of another phase-based types of features with our baseline platform led to the overall improvement of performance of the system for different levels of emotional arousal. These results confirm that combination of phase information and magnitude information leads to the overall improvement of performance of such system and also that combination of different types of features representing phase information brings additional increment of the performance.

Full Text