Speech Emotion Recognition Using 3D Convolutions and Attention-Based Sliding Recurrent Networks With Auditory Front-Ends

Zhichao Peng,Masato Akagi,Masashi Unoki,Xingfeng Li,Zhi Zhu,Jianwu Dang

doi:10.1109/access.2020.2967791

Abstract

Emotion information from speech can effectively help robots understand speaker's intentions in natural human-robot interaction. The human auditory system can easily track temporal dynamics of emotion by perceiving the intensity and fundamental frequency of speech, and focus on the salient emotion regions. Therefore, speech emotion recognition combined with the auditory mechanism and attention mechanism may be an effective way. Some previous studies used auditory-based static features to identify emotion while ignoring the emotion dynamics. Some other studies used attention models to capture the salient regions of emotion while ignoring cognitive continuity. To fully utilize the auditory and attention mechanism, we first investigate temporal modulation cues from auditory front-ends and then propose a joint deep learning model that combines 3D convolutions and attention-based sliding recurrent neural networks (ASRNNs) for emotion recognition. Our experiments on the IEMOCAP and MSP-IMPROV datasets indicate that the proposed method can be effectively used to recognize the emotions of speech from temporal modulation cues. The subjective evaluation shows that the attention patterns of the attention model are basically consistent with human behaviors in recognizing the emotions.

Highlights

Speech is the most natural way for communication between humans and robots
The IEMOCAP dataset consists of five sessions, where each session contains scripted and improvised utterances from two speakers
Taking into account, that the human auditory system has a very strong ability to perceive the intensity and fundamental frequency of speech, it can track temporal dynamics of emotion from the perceived information and focus on the salient emotion regions, we propose a speech emotion recognition (SER) system by combining auditory mechanism and attention mechanism of human auditory system

Summary

Introduction

Speech is the most natural way for communication between humans and robots. The key point of effective communication is to make robots or virtual agents understand speakers’ true intentions. Only using the linguistic information is by no means sufficient enough for understanding of intentions. The vocal emotion information as a kind of nonlinguistic information can significantly help robots or virtual agents to understand speakers’ true intentions. Speech emotion recognition (SER) is the research hotspot in natural human-robot interaction (HRI). Effective SER is still a very challenging problem, partly due to

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 53	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Speech Emotion Recognition Using 3D Convolutions and Attention-Based Sliding Recurrent Networks With Auditory Front-Ends

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Auditory-Inspired End-to-End Speech Emotion Recognition Using 3D Convolutional Recurrent Neural Networks Based on Spectral-Temporal Representation
Zhichao Peng ... Jianwu Dang
-
Zhichao Peng, et. al.Zhichao Peng ... Jianwu Dang
01 Jul 2018
01 Jul 2018

Attention-based sequence modeling for categorical emotion recognition with modulation spectral feature
Peng Zhichao ... Xiao Minlei
-
Peng Zhichao, et. al.Peng Zhichao ... Xiao Minlei
01 Dec 2020
01 Dec 2020

Dimensional Emotion Recognition from Speech Using Modulation Spectral Features and Recurrent Neural Networks
Zhichao Peng ... Jianwu Dang
-
Zhichao Peng, et. al.Zhichao Peng ... Jianwu Dang
01 Nov 2019
01 Nov 2019

Multi-resolution modulation-filtered cochleagram feature for LSTM-based dimensional emotion recognition from speech
Zhichao Peng ... Masato Akagi
Neural Networks | VOL. 140
Zhichao Peng, et. al.Zhichao Peng ... Masato Akagi
25 Mar 2021
Neural Networks | VOL. 140

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Speech Emotion Recognition Using 3D Convolutions and Attention-Based Sliding Recurrent Networks With Auditory Front-Ends

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access