CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network

Mustaqeem Mustaqeem,Soonil Kwon

doi:10.3390/math8122133

Mustaqeem Mustaqeem, Soonil Kwon

Open Access

https://doi.org/10.3390/math8122133

Copy DOI

Journal: Mathematics	Publication Date: Nov 30, 2020
Citations: 87	License type: CC BY 4.0

Affiliation: Sejong University

Abstract

Artificial intelligence, deep learning, and machine learning are dominant sources to use in order to make a system smarter. Nowadays, the smart speech emotion recognition (SER) system is a basic necessity and an emerging research area of digital audio signal processing. However, SER plays an important role with many applications that are related to human–computer interactions (HCI). The existing state-of-the-art SER system has a quite low prediction performance, which needs improvement in order to make it feasible for the real-time commercial applications. The key reason for the low accuracy and the poor prediction rate is the scarceness of the data and a model configuration, which is the most challenging task to build a robust machine learning technique. In this paper, we addressed the limitations of the existing SER systems and proposed a unique artificial intelligence (AI) based system structure for the SER that utilizes the hierarchical blocks of the convolutional long short-term memory (ConvLSTM) with sequence learning. We designed four blocks of ConvLSTM, which is called the local features learning block (LFLB), in order to extract the local emotional features in a hierarchical correlation. The ConvLSTM layers are adopted for input-to-state and state-to-state transition in order to extract the spatial cues by utilizing the convolution operations. We placed four LFLBs in order to extract the spatiotemporal cues in the hierarchical correlational form speech signals using the residual learning strategy. Furthermore, we utilized a novel sequence learning strategy in order to extract the global information and adaptively adjust the relevant global feature weights according to the correlation of the input features. Finally, we used the center loss function with the softmax loss in order to produce the probability of the classes. The center loss increases the final classification results and ensures an accurate prediction as well as shows a conspicuous role in the whole proposed SER scheme. We tested the proposed system over two standard, interactive emotional dyadic motion capture (IEMOCAP) and ryerson audio visual database of emotional speech and song (RAVDESS) speech corpora, and obtained a 75% and an 80% recognition rate, respectively.

Highlights

Speech emotion recognition (SER) is an active area of research and a better way to communicate using among human–computer interaction (HCI)
The proposed model architecture consists of three modules that include the local features learning blocks (LFLBs) that use the ConvLSTM, the global features learning block (GFLB) that uses the gated recurrent units (GRUs), and the multi-class classification that uses the center and the softmax losses with a pre-processing
We evaluated the proposed speech emotion recognition (SER) approach over different benchmarks, which included the interactive emotional dyadic motion capture (IEMOCAP) [18] and the RAVDESS [19] speech corpora

Summary

Introduction

Speech emotion recognition (SER) is an active area of research and a better way to communicate using among human–computer interaction (HCI). The researchers have utilized the modest end-to-end models for the emotion recognition, which include convolution neural networks (CNNs), recurrent neural networks (RNNs), long short-term memory (LSTM), and deep belief networks (DBNs) [14,15] These models extract high-level salient features from the speech signals that achieved a better recognition rate compared to the low-level features [8,13]. The researchers have utilized the RNN and the LSTM in order to learn the long-term dependencies and recognize the emotions These techniques have not revealed any significant changes in accuracy, but they increased the cost computation and training time of the whole model [16].

Related Works

The Proposed SER Framework

ConvLSTM in the SER Model

Our Model Configuration

Experimental Evaluation and Discussion

Conclusions and Future Direction

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics

Lead the way for us

Similar Papers

A New Deep-Learning Method for Human Activity Recognition.
Roberta Vrskova ... Patrik Kamencay
Sensors | VOL. 23
Roberta Vrskova, et. al.Roberta Vrskova ... Patrik Kamencay
04 Mar 2023
Sensors | VOL. 23

RMWSaug: Robust Multi-window Spectrogram Augmentation Approach for Deep Learning based Speech Emotion Recognition
Shehu Mohammed Yusuf ... E A Adedokun
-
Shehu Mohammed Yusuf, et. al.Shehu Mohammed Yusuf ... E A Adedokun
06 Oct 2021
06 Oct 2021

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition.
Mustaqeem ... Soonil Kwon
Sensors | VOL. 20
Mustaqeem, et. al. Mustaqeem ... Soonil Kwon
28 Dec 2019
Sensors | VOL. 20

Speech emotion recognition with deep convolutional neural networks
Dias Issa ... Adnan Yazici
Biomedical Signal Processing and Control | VOL. 59
Dias Issa, et. al.Dias Issa ... Adnan Yazici
27 Feb 2020
Biomedical Signal Processing and Control | VOL. 59

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mathematics