Abstract

The automatic detection of an emotional state from human speech, which plays a crucial role in the area of human–machine interaction, has consistently been shown to be a difficult task for machine learning algorithms. Previous work on emotion recognition has mostly focused on the extraction of carefully hand-crafted and highly engineered features. Results from these works have demonstrated the importance of discriminative spatio-temporal features to model the continual evolutions of different emotions. Recently, spectrogram representations of emotional speech have achieved competitive performance for automatic speech emotion recognition (SER). How machine learning algorithms learn the effective compositional spatio-temporal dynamics for SER has been a fundamental problem of deep representations, herein denoted as deep spectrum representations. In this paper, we develop a model to alleviate this limitation by leveraging a parallel combination of attention-based bidirectional long short-term memory recurrent neural networks with attention-based fully convolutional networks (FCN). The extensive experiments were undertaken on the interactive emotional dyadic motion capture (IEMOCAP) and FAU aibo emotion corpus (FAU-AEC) to highlight the effectiveness of our approach. The experimental results indicate that deep spectrum representations extracted from the proposed model are well-suited to the task of SER, achieving a WA of 68.1% and a UA of 67.0% on IEMOCAP, and 45.4% for UA on FAU-AEC dataset. Key results indicate that the extracted deep representations combined with a linear support vector classifier are comparable in performance with eGeMAPS and COMPARE, two standard acoustic feature representations.

Highlights

  • Automatic emotion recognition from speech signals, aiming at the identification of our basic emotional states using machine learning, remains a difficult task

  • The main contributions of this article are, as follows: i) we propose a novel framework to fuse both spatial and temporal representations for speech emotion recognition (SER) by leveraging attention-based fully convolutional networks (FCN) with attention-based BLSTM-RNNs, an approach capable of automatically learning feature representations and modeling the temporal dependencies; ii) following the recent success of applying deep learning methods directly to spectrograms, enhanced deep spectrum representations are derived from forwarding spectrograms through the Attention-BLSTM-FCN model; and iii) the proposed method can be adapted to enhance existing state-of-the-art methods

  • interactive emotional dyadic motion capture (IEMOCAP) consists of audio-visual data with transcriptions from recordings of dialogues between two professional actors, over 5 sessions, with the corpus divided into two parts: improvise and script [43]

Read more

Summary

Introduction

Automatic emotion recognition from speech signals, aiming at the identification of our basic emotional states using machine learning, remains a difficult task. Many previous research efforts have investigated several hand-crafted acoustic features for the task of speech emotion recognition (SER), such as prosodic features Z. Zhao et al.: Exploring Deep Spectrum Representations via Attention-Based Recurrent and CNNs for SER features such as the Teager-energy-operator (TEO). With the increased use of neural networks for SER tasks, mel-scale filterbank spectrograms are widely used as an input feature. Deep spectrum representations, which are features automatically extracted from speech spectrogram images using deep learning models, have produced promising results in the fields of SER [1] and other speech and audio related applications [1]–[3]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.