Abstract

Many speech emotion recognition systems have been designed using different features and classification methods. Still, there is a lack of knowledge and reasoning regarding the underlying speech characteristics and processing, i.e., how basic characteristics, methods, and settings affect the accuracy, to what extent, etc. This study is to extend physical perspective on speech emotion recognition by analyzing basic speech characteristics and modeling methods, e.g., time characteristics (segmentation, window types, and classification regions—lengths and overlaps), frequency ranges, frequency scales, processing of whole speech (spectrograms), vocal tract (filter banks, linear prediction coefficient (LPC) modeling), and excitation (inverse LPC filtering) signals, magnitude and phase manipulations, cepstral features, etc. In the evaluation phase the state-of-the-art classification method and rigorous statistical tests were applied, namely N-fold cross validation, paired t-test, rank, and Pearson correlations. The results revealed several settings in a 75% accuracy range (seven emotions). The most successful methods were based on vocal tract features using psychoacoustic filter banks covering the 0–8 kHz frequency range. Well scoring are also spectrograms carrying vocal tract and excitation information. It was found that even basic processing like pre-emphasis, segmentation, magnitude modifications, etc., can dramatically affect the results. Most findings are robust by exhibiting strong correlations across tested databases.

Highlights

  • Speech provides a natural and very complex form of communication as it can be rather precise by the means of grammar and it can convey side information about our state and our attitude to what is said or to whom it is told

  • In general the same ranking of methods, i.e., vocal tract based expressed by Gammatone FB (GFB), vocal tract and excitation signals using spectrograms and the least successful utilizing phases were observed on both databases

  • A wide set of experiments targeting basic speech signal characteristics, and speech processing methods applied to speech emotion recognition (SER) were deigned and evaluated

Read more

Summary

Introduction

Speech provides a natural and very complex form of communication as it can be rather precise by the means of grammar and it can convey side information about our state (physical, mental) and our attitude to what is said or to whom it is told (emotions). Such information is added to the speech rather unintentionally and humans can spot the smallest swings in their mood. SER systems have several applications and their usage is still widening. SER systems are useful in automotive industry, where they can increase safety for drivers, e.g., in [5] an in-car conversation was analyzed, and in [6] visual emotion recognition was added making the system more accurate

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call