Abstract

The classification of emotional speech is mostly considered in speech-related research on human-computer interaction (HCI). In this paper, the purpose is to present a novel feature extraction based on multi-resolutions texture image information (MRTII). The MRTII feature set is derived from multi-resolution texture analysis for characterization and classification of different emotions in a speech signal. The motivation is that we have to consider emotions have different intensity values in different frequency bands. In terms of human visual perceptual, the texture property on multi-resolution of emotional speech spectrogram should be a good feature set for emotion classification in speech. Furthermore, the multi-resolution analysis on texture can give a clearer discrimination between each emotion than uniform-resolution analysis on texture. In order to provide high accuracy of emotional discrimination especially in real-life, an acoustic activity detection (AAD) algorithm must be applied into the MRTII-based feature extraction. Considering the presence of many blended emotions in real life, in this paper make use of two corpora of naturally-occurring dialogs recorded in real-life call centers. Compared with the traditional Mel-scale Frequency Cepstral Coefficients (MFCC) and the state-of-the-art features, the MRTII features also can improve the correct classification rates of proposed systems among different language databases. Experimental results show that the proposed MRTII-based feature information inspired by human visual perception of the spectrogram image can provide significant classification for real-life emotional recognition in speech.

Highlights

  • Speech emotion recognition (SER) is one of the most fundamental components for human machine/computer interaction (HCI)

  • We find that the calculation of the spectral entropy parameter implies that the spectral entropy depends only on the variation of the spectral energy but not on the amount of spectral energy

  • In real-life environment, the spectral entropy parameter is robust against changing signal levels, even though signal amplitude varies with the emotional state

Read more

Summary

Introduction

Speech emotion recognition (SER) is one of the most fundamental components for human machine/computer interaction (HCI). The compact and highly efficient representation carries much information about parameters such as energy, pitch F0, formants and timing These parameters are the acoustic features of speech most often used in emotion recognition systems [21,22]. In order to increase the accuracy in real-life emotional recognition, a novel feature extraction based on multi-resolution texture image information (MRTII) has been proposed in this paper. We are able to zoom into any desired frequency channel of each emotion for further decomposition, so the desired sub-band images will contain rich texture information while the emotional speech spectrogram image with TSWT is decomposed into four sub-band images.

The Proposed MRTII-Based Features
The Calculation of Gray-Level Spectrogram Image
The Compensation of Spectrogram Image
Multi-Resolution Texture Analysis
EMO-DB
KHUSC-EmoDB
Real-Life Database
MFCC Features
Prosodic Features
The LLD Features
Experiments and Results
The Emotional Database
Classification Comparison
Evaluation in Artificial Databases
Evaluation in Real-life Corpora
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call