Abstract

Recently, Speech Emotion Recognition (SER) has become an important research topic of affective computing. It is a difficult problem, where some of the greatest challenges lie in the feature selection and representation tasks. A good feature representation should be able to reflect global trends as well as temporal structure of the signal, since emotions naturally evolve in time; it has become possible with the advent of Recurrent Neural Networks (RNN), which are actively used today for various sequence modeling tasks. This paper proposes a hybrid approach to feature representation, which combines traditionally engineered statistical features with Long Short-Term Memory (LSTM) sequence representation in order to take advantage of both short-term and long-term acoustic characteristics of the signal, therefore capturing not only the general trends but also temporal structure of the signal. The evaluation of the proposed method is done on three publicly available acted emotional speech corpora in three different languages, namely RUSLANA (Russian speech), BUEMODB (Turkish speech) and EMODB (German speech). Compared to the traditional approach, the results of our experiments show an absolute improvement of 2.3% and 2.8% for two out of three databases, and a comparative performance on the third. Therefore, provided enough training data, the proposed method proves effective in modelling emotional content of speech utterances.

Highlights

  • Automatic emotion recognition has emerged as one of the most important and challenging research topics of affective computing [1, 2], a modern study concerned with recognizing and processing human feelings

  • After experimenting with the number of components used in the Principal Component Analysis (PCA) analysis we can conclude that for different datasets the optimal number of principal components differs, which may be explained by different data distributions due to varying recording conditions and audio signal quality

  • We have proposed a new method for combining two feature representations for emotion classification from speech: a frame-level representation of low-level descriptors and an utterance-level representation of LLD functionals

Read more

Summary

Introduction

Automatic emotion recognition has emerged as one of the most important and challenging research topics of affective computing [1, 2], a modern study concerned with recognizing and processing human feelings. The baseline methods against which we compare the results consists of a single branch feature representation using predefined INTERSPEECH 2010 feature set (utterance-level functionals extracted via openSMILE toolkit), and a single LSTM Neural Network. The reason for combining the two different representations is that short-time characteristics (frame level features) together with appropriate modelling techniques allow capturing the temporal structure of the signal, while long-time characteristics (utterance-level features) are capable of expressing general trends [24].

Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.