DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Nasir Saleem,Yudong Zhang,Rizwana Irfan,Seifedine Kadry,Hafiz Tayyab Rauf,Ahmad Almadhor,Jiechao Gao

doi:10.1049/cit2.12233

Abstract

AbstractSpeech emotion recognition (SER) is an important research problem in human‐computer interaction systems. The representation and extraction of features are significant challenges in SER systems. Despite the promising results of recent studies, they generally do not leverage progressive fusion techniques for effective feature representation and increasing receptive fields. To mitigate this problem, this article proposes DeepCNN, which is a fusion of spectral and temporal features of emotional speech by parallelising convolutional neural networks (CNNs) and a convolution layer‐based transformer. Two parallel CNNs are applied to extract the spectral features (2D‐CNN) and temporal features (1D‐CNN) representations. A 2D‐convolution layer‐based transformer module extracts spectro‐temporal features and concatenates them with features from parallel CNNs. The learnt low‐level concatenated features are then applied to a deep framework of convolutional blocks, which retrieves high‐level feature representation and subsequently categorises the emotional states using an attention gated recurrent unit and classification layer. This fusion technique results in a deeper hierarchical feature representation at a lower computational cost while simultaneously expanding the filter depth and reducing the feature map. The Berlin Database of Emotional Speech (EMO‐BD) and Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets are used in experiments to recognise distinct speech emotions. With efficient spectral and temporal feature representation, the proposed SER model achieves 94.2% accuracy for different emotions on the EMO‐BD and 81.1% accuracy on the IEMOCAP dataset respectively. The proposed SER system, DeepCNN, outperforms the baseline SER systems in terms of emotion recognition accuracy on the EMO‐BD and IEMOCAP datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: CAAI Transactions on Intelligence Technology	Publication Date: May 26, 2023
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Abstract

Talk to us

Similar Papers

More From: CAAI Transactions on Intelligence Technology

Lead the way for us

Similar Papers

Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer.
Rizwan Ullah ... Lunchakorn Wuttisittikulkij
Sensors | VOL. 23
Rizwan Ullah, et. al.Rizwan Ullah ... Lunchakorn Wuttisittikulkij
07 Jul 2023
Sensors | VOL. 23

Speech Emotion Recognition Using Convolution Neural Networks
Krishna Chauhan ... Tarun Varma
-
Krishna Chauhan, et. al.Krishna Chauhan ... Tarun Varma
25 Mar 2021
25 Mar 2021

Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
Samuel Kakuba ... Dong Seog Han
IEEE Access | VOL. 10
Samuel Kakuba, et. al.Samuel Kakuba ... Dong Seog Han
01 Jan 2021
IEEE Access | VOL. 10

Research on Speech Emotional Feature Extraction Based on Multidimensional Feature Fusion
Chunjun Zheng ... Wei Sun
-
Chunjun Zheng, et. al.Chunjun Zheng ... Wei Sun
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Abstract

Talk to us

Similar Papers

More From: CAAI Transactions on Intelligence Technology