Abstract

Many researchers are inspired by studying Speech Emotion Recognition (SER) because it is considered as a key effort in Human-Computer Interaction (HCI). The main focus of this work is to design a model for emotion recognition from speech, which has plenty of challenges within it. Due to the time series and sparse nature of emotion in speech, we have adopted a multivariate time series feature representation of the input data. The work has also adopted the Echo State Network (ESN) which includes reservoir computing as a special case of the Recurrent Neural Network (RNN) to avoid model complexity because of its untrained and sparse nature when mapping the features into a higher dimensional space. Additionally, we applied dimensionality reduction since it offers significant computational advantages by using Sparse Random Projection (SRP). Late fusion of bidirectionality input has been applied to capture additional information independently of the input data. The experiments for speaker-independent and/or speaker-dependent were performed on four common speech emotion datasets which are Emo-DB, SAVEE, RAVDESS, and FAU Aibo Emotion Corpus. The results show that the designed model outperforms the state-of-the-art with a cheaper computation cost.

Highlights

  • Emotion can play an important role in many parts of a human’s life such as communicating, understanding, helping each other, rational thinking, creativity and sometimes it has a vital part in decision making

  • METHODOLOGY the model design is presented, and the proposed model is briefly explained. It represents the main components of the solution and explains how the proposed method helps to improve the performance of Echo State Network (ESN) to recognize emotions from speech

  • There are few works that used ESN for the Speech Emotion Recognition (SER) systems [14] [37] [38], none of them reached an outstanding performance. This unconvincing performance may be due to three factors which are: 1) adopting a unidirectional signal processing which results in losing important information between the speech frames in the opposite direction, 2) ESN for temporal data produces a very high dimensional representation that negatively influences the performance of the classifier, and 3) the manual tuning of the ESN hyperparameters instead of optimizing them may not lead to optimum performance of the ESN model

Read more

Summary

Introduction

Emotion can play an important role in many parts of a human’s life such as communicating, understanding, helping each other, rational thinking, creativity and sometimes it has a vital part in decision making. Detecting emotion is a challenging task and it has become a hot field of research topics and covered a wide research area due to the high demand for using it in many practical applications such as healthcare, social robot, and Human-Computer Interaction (HCI) [1] [2]. Quick, and important way for individuals to communicate with each other [4] and the speech signal is considered as a fast and useful mechanism for HCI. Emotions have always been a part of normal human conversation which makes the speech more attractive and more effective. Detecting emotions from speech signals is an old yet big challenge in the field of artificial intelligence [5] which makes many researchers inspired to work on it

Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.