End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

Duowei Tang,Peter Kuppens,Luc Geurts,Toon Van Waterschoot

doi:10.1186/s13636-021-00208-5

Duowei Tang, Peter Kuppens + Show 2 more

Open Access

https://doi.org/10.1186/s13636-021-00208-5

Copy DOI

Abstract

Amongst the various characteristics of a speech signal, the expression of emotion is one of the characteristics that exhibits the slowest temporal dynamics. Hence, a performant speech emotion recognition (SER) system requires a predictive model that is capable of learning sufficiently long temporal dependencies in the analysed speech signal. Therefore, in this work, we propose a novel end-to-end neural network architecture based on the concept of dilated causal convolution with context stacking. Firstly, the proposed model consists only of parallelisable layers and is hence suitable for parallel processing, while avoiding the inherent lack of parallelisability occurring with recurrent neural network (RNN) layers. Secondly, the design of a dedicated dilated causal convolution block allows the model to have a receptive field as large as the input sequence length, while maintaining a reasonably low computational cost. Thirdly, by introducing a context stacking structure, the proposed model is capable of exploiting long-term temporal dependencies hence providing an alternative to the use of RNN layers. We evaluate the proposed model in SER regression and classification tasks and provide a comparison with a state-of-the-art end-to-end SER model. Experimental results indicate that the proposed model requires only 1/3 of the number of model parameters used in the state-of-the-art model, while also significantly improving SER performance. Further experiments are reported to understand the impact of using various types of input representations (i.e. raw audio samples vs log mel-spectrograms) and to illustrate the benefits of an end-to-end approach over the use of hand-crafted audio features. Moreover, we show that the proposed model can efficiently learn intermediate embeddings preserving speech emotion information.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: EURASIP Journal on Audio, Speech, and Music Processing	Publication Date: May 12, 2021
Citations: 18	License type: open-access

R Discovery Prime

R Discovery Prime

End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

Abstract

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing

Lead the way for us

Similar Papers

Adieu recurrence? End-to-end speech emotion recognition using a context stacking dilated convolutional network
Duowei Tang ... Luc Geurts
-
Duowei Tang, et. al.Duowei Tang ... Luc Geurts
24 Jan 2021
24 Jan 2021

Robust emotion recognition in noisy speech via sparse representation
Xiaoming Zhao ... Shiqing Zhang
Neural Computing and Applications | VOL. 24
Xiaoming Zhao, et. al.Xiaoming Zhao ... Shiqing Zhang
29 Mar 2013
Neural Computing and Applications | VOL. 24

Speech Emotion Recognition of Teachers in Classroom Teaching
Liang Jie ... Zhao Xiaoyan
-
Liang Jie, et. al.Liang Jie ... Zhao Xiaoyan
01 Aug 2020
01 Aug 2020

Emotional speech Recognition using CNN and Deep learning techniques
C Hema ... Fausto Pedro Garcia Marquez
Applied Acoustics | VOL. 211
C Hema, et. al.C Hema ... Fausto Pedro Garcia Marquez
28 Jun 2023
Applied Acoustics | VOL. 211

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

End-to-end speech emotion recognition using a novel context-stacking dilated convolution neural network

Abstract

Talk to us

Similar Papers

More From: EURASIP Journal on Audio, Speech, and Music Processing