Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM

Mustaqeem Mustaqeem,Soonil Kwon,Muhammad Sajjad

doi:10.1109/access.2020.2990405

Abstract

Emotional state recognition of a speaker is a difficult task for machine learning algorithms which plays an important role in the field of speech emotion recognition (SER). SER plays a significant role in many real-time applications such as human behavior assessment, human-robot interaction, virtual reality, and emergency centers to analyze the emotional state of speakers. Previous research in this field is mostly focused on handcrafted features and traditional convolutional neural network (CNN) models used to extract high-level features from speech spectrograms to increase the recognition accuracy and overall model cost complexity. In contrast, we introduce a novel framework for SER using a key sequence segment selection based on redial based function network (RBFN) similarity measurement in clusters. The selected sequence is converted into a spectrogram by applying the STFT algorithm and passed into the CNN model to extract the discriminative and salient features from the speech spectrogram. Furthermore, we normalize the CNN features to ensure precise recognition performance and feed them to the deep bi-directional long short-term memory (BiLSTM) to learn the temporal information for recognizing the final state of emotion. In the proposed technique, we process the key segments instead of the whole utterance to reduce the computational complexity of the overall model and normalize the CNN features before their actual processing, so that it can easily recognize the Spatio-temporal information. The proposed system is evaluated over different standard dataset including IEMOCAP, EMO-DB, and RAVDESS to improve the recognition accuracy and reduce the processing time of the model, respectively. The robustness and effectiveness of the suggested SER model is proved from the experimentations when compared to state-of-the-art SER methods with an achieve up to 72.25%, 85.57%, and 77.02% accuracy over IEMOCAP, EMO-DB, and RAVDESS dataset, respectively.

Highlights

OF SERAutomatic recognition and identification of emotions from speech signals in speech emotion recognition (SER) using machine learning is a challenging task [1]
We planned a novel approach for SER to improve the recognition accuracy and reduce the overall model cost computation and processing time
We evaluated the proposed system on three standard IEMOCAP, EMO-DB, and RAVDESS datasets to check the robustness of the system

Summary

Introduction

Automatic recognition and identification of emotions from speech signals in speech emotion recognition (SER) using machine learning is a challenging task [1]. Researchers are facing a major challenge in feature extraction i.e., how to select a robust method to extract salient and discriminative features from speech. Many researchers have investigated low-level handcrafted features for SER such as energy, zero-crossing, pitch, linear predictor coefficient, Mel-frequency MFCC, and nonlinear features such as tiger energy operator. Mostly researchers utilize deep learning techniques for SER using Mel-scale filter bank speech spectrogram as an input feature. A spectrogram is a 2-D representation of speech signals which is widely used in convolutional neural networks (CNNs) for extracting the salient and discriminative features in SER [2] and other signal processing applications [3], [4]. 2-D CNNs are specially designed for visual recognition tasks [5]–[7]

Objectives

Methods

Findings

Discussion

Conclusion