Abstract

In generation of emotional speech, there are deviations in the speech production features when compared to neutral (non-emotional) speech. The objective of this study is to capture the deviations in features related to the excitation component of speech and to develop a system for automatic recognition of emotions based on these deviations. The emotions considered in this study are anger, happiness, sadness and neutral state. The study shows that there are useful features in the deviations of the excitation features, which can be exploited to develop an emotion recognition system. The excitation features used in this study are the instantaneous fundamental frequency (F_0), the strength of excitation, the energy of excitation and the ratio of the high-frequency to low-frequency band energy (beta ). A hierarchical binary decision tree approach is used to develop an emotion recognition system with neutral speech as reference. The recognition experiments showed that the excitation features are comparable or better than the existing prosody features and spectral features, such as mel-frequency cepstral coefficients, perceptual linear predictive coefficients and modulation spectral features.

Highlights

  • The goal of speech technology is to make human–machine interaction as natural as possible

  • This is because the features such as F0 show an increasing trend and strength of excitation (SoE) shows a decreasing trend for anger and happiness when compared to neutral speech [16,41]

  • The emotion recognition system based on the features related to excitation component of speech production was developed by considering emotion states as deviations from neutral state

Read more

Summary

Introduction

The goal of speech technology is to make human–machine interaction as natural as possible. The third type of emotional speech databases is (near to) natural database, where recordings do not involve any prompting or the obvious eliciting of emotional responses. Sources for such natural situations are mostly from talk shows in TV broadcasts, interviews, group interactions, etc. A few studies have analyzed emotional speech using voice source features [1,29,41,53,54,56,57] Most of these studies [1,52,53,54,57] have focused mainly on specific utterances like vowels. For extraction of these voice source features, glottal flow estimates have been computed in these studies by using iterative adaptive inverse filtering (IAIF) [2]

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call