Abstract

The most commonly adopted approaches in speech emotion recognition (SER) utilize magnitude spectrum and nonlinear Teager energy operator (TEO) based features while information about phase spectrum is often omitted. The information about phase has been frequently overlooked in approaches applied by speech processing researchers due to the signal processing difficulties. We present study of two phase-based features: The relative phase shift (RPS) based features and modified group delay features (MODGDF) that represents phase structure of speech in the task of emotional arousal recognition. The evaluation is performed on the CRISIS acted speech database which allows us to recognize five levels of emotional arousal from speech. To exploit these features, we employ concept of deep neural network. The efficiency of the approaches based on features mentioned earlier is compared to baseline platform using Mel frequency cepstral coefficients (MFCCs) and all pole group delay features (APGD). The combination of another phase-based types of features with our baseline platform led to the overall improvement of performance of the system for different levels of emotional arousal. These results confirm that combination of phase information and magnitude information leads to the overall improvement of performance of such system and also that combination of different types of features representing phase information brings additional increment of the performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call