Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition.

Hua Zhang,Guojun Dai,Fangyao Shen,Jili Shang,Ruoyun Gou,Yifan Wu

doi:10.3389/fphys.2021.643202

Hua Zhang, Guojun Dai + Show 4 more

Open Access

https://doi.org/10.3389/fphys.2021.643202

Copy DOI

Abstract

Speech emotion recognition (SER) is a difficult and challenging task because of the affective variances between different speakers. The performances of SER are extremely reliant on the extracted features from speech signals. To establish an effective features extracting and classification model is still a challenging task. In this paper, we propose a new method for SER based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with Attention (BLSTMwA) model (DCNN-BLSTMwA). We first preprocess the speech samples by data enhancement and datasets balancing. Secondly, we extract three-channel of log Mel-spectrograms (static, delta, and delta-delta) as DCNN input. Then the DCNN model pre-trained on ImageNet dataset is applied to generate the segment-level features. We stack these features of a sentence into utterance-level features. Next, we adopt BLSTM to learn the high-level emotional features for temporal summarization, followed by an attention layer which can focus on emotionally relevant features. Finally, the learned high-level emotional features are fed into the Deep Neural Network (DNN) to predict the final emotion. Experiments on EMO-DB and IEMOCAP database obtain the unweighted average recall (UAR) of 87.86 and 68.50%, respectively, which are better than most popular SER methods and demonstrate the effectiveness of our propose method.

Highlights

As the most natural and convenient medium in human communication, speech signals contain the linguistic information like semantic and language type, and contain rich nonlinguistic information, such as facial expression, speech emotion, and so on
Inspired by Zhang et al (2017) and Zhao et al (2018), in this paper, we propose a novel method based on Deep Convolution Neural Network (DCNN) and Bidirectional Long Short-Term Memory with attention model (DCNN-Bidirectional Long Short-Term Memory with Attention (BLSTMwA))
(2) We demonstrate that the three channels of log Melspectrograms (3-D log-Mels) as DCNN input is suitable for affective feature extraction which achieves better performance than Level Descriptors (LLDs)

Summary

INTRODUCTION

As the most natural and convenient medium in human communication, speech signals contain the linguistic information like semantic and language type, and contain rich nonlinguistic information, such as facial expression, speech emotion, and so on. Zhang et al (2017) proposed a new method which directly to use three channels of log Mel-spectrograms as the pre-trained DCNN’s input They used pyramid matching algorithm (DTPM) to normalize the segment-level features with unequal length. In 2018, Zheng et al (2018) proposed a new SER model combine with convolutional neural network (CNN) and random forest (RF) They adopted CNN to extract the emotional features from spectrograms, and used RF for classification. Attention mechanism can increase relatively high weights to emotionrelated features, emphasizing the importance of these features, and reduce the influence of irrelevant features It can help the network automatically focus on the emotion relevant segments and obtain discriminative features with utterance-level for SER.

PROPOSED METHODOLOGY

Preprocessing

Log Mel-Spectrograms

Pre-training and Finetuning

Architecture of DCNN-BLSTMwA

Attention Layer Attention layer

DNN Classification

Datasets

Experiment Setup

Experiment Results

Method

CONCLUSIONS AND FUTURE WORK

Findings

DATA AVAILABILITY STATEMENT

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Physiology	Publication Date: Mar 2, 2021
Citations: 16	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Physiology

Lead the way for us

Similar Papers

A comparative evaluation of deep convolutional neural network and deep neural network-based land use/land cover classifications of mining regions using fused multi-sensor satellite data
Ajay Kumar ... Amit Kumar Gorai
Advances in Space Research | VOL. 72
Ajay Kumar, et. al.Ajay Kumar ... Amit Kumar Gorai
04 Sep 2023
Advances in Space Research | VOL. 72

Deep Convolutional Neural Networks for Feature Extraction in Speech Emotion Recognition
Panikos Heracleous ... Akio Yoneyama
-
Panikos Heracleous, et. al.Panikos Heracleous ... Akio Yoneyama
01 Jan 2019
01 Jan 2019

Deep Convolutional Neural Networks with Transfer Learning for Neonatal Pain Expression Recognition
Guanming Lu ... Qiang Hao
-
Guanming Lu, et. al.Guanming Lu ... Qiang Hao
01 Jul 2018
01 Jul 2018

Abstract 1394: Diagnosis of thyroid cancer using deep convolutional neural network models applied to sonographic images from clinical ultrasound exams
Xiangchun Li ...
American Journal of Cancer | VOL. 79
Xiangchun Li, et. al.Xiangchun Li ...
01 Jul 2019
American Journal of Cancer | VOL. 79

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Physiology