Abstract

Recently, researchers have paid escalating attention to studying the emotional state of an individual from his/her speech signals as the speech signal is the fastest and the most natural method of communication between individuals. In this work, new feature enhancement using Gaussian mixture model (GMM) was proposed to enhance the discriminatory power of the features extracted from speech and glottal signals. Three different emotional speech databases were utilized to gauge the proposed methods. Extreme learning machine (ELM) andk-nearest neighbor (kNN) classifier were employed to classify the different types of emotions. Several experiments were conducted and results show that the proposed methods significantly improved the speech emotion recognition performance compared to research works published in the literature.

Highlights

  • Spoken utterances of an individual can provide information about his/her health state, emotion, language used, gender, and so on

  • The average emotion recognition rates for the original raw and enhanced relative wavelet packet energy and entropy features and for best enhanced features were tabulated in Tables 3, 4, and 5

  • Extreme learning machine (ELM) kernel always performs better compared to k-nearest neighbor (kNN) classifier in terms of average emotion recognition rates irrespective of different orders of “db” wavelets

Read more

Summary

Introduction

Spoken utterances of an individual can provide information about his/her health state, emotion, language used, gender, and so on. Most of the existing emotional speech database contains three types of emotional speech recordings such as simulated, elicited, and natural ones. Emotions are nearer to the natural database but if the speakers know that they are being recorded, the quality will be artificial. High emotion recognition accuracies were obtained for two-class emotion recognition (high arousal versus low arousal), but multiclass emotion recognition is still disputing. This is due to the following reasons: (a) which speech features are information-rich and parsimonious, (b) different sentences, speakers, speaking styles, and rates, (c) more than one perceived emotion in the same utterance, and (d) long-term/short-term emotional states [1, 3, 4]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call