Speech Emotion Recognition System Research Articles

Artificial intelligence, deep learning, and machine learning are dominant sources to use in order to make a system smarter. Nowadays, the smart speech emotion recognition (SER) system is a basic necessity and an emerging research area of digital audio signal processing. However, SER plays an important role with many applications that are related to human–computer interactions (HCI). The existing state-of-the-art SER system has a quite low prediction performance, which needs improvement in order to make it feasible for the real-time commercial applications. The key reason for the low accuracy and the poor prediction rate is the scarceness of the data and a model configuration, which is the most challenging task to build a robust machine learning technique. In this paper, we addressed the limitations of the existing SER systems and proposed a unique artificial intelligence (AI) based system structure for the SER that utilizes the hierarchical blocks of the convolutional long short-term memory (ConvLSTM) with sequence learning. We designed four blocks of ConvLSTM, which is called the local features learning block (LFLB), in order to extract the local emotional features in a hierarchical correlation. The ConvLSTM layers are adopted for input-to-state and state-to-state transition in order to extract the spatial cues by utilizing the convolution operations. We placed four LFLBs in order to extract the spatiotemporal cues in the hierarchical correlational form speech signals using the residual learning strategy. Furthermore, we utilized a novel sequence learning strategy in order to extract the global information and adaptively adjust the relevant global feature weights according to the correlation of the input features. Finally, we used the center loss function with the softmax loss in order to produce the probability of the classes. The center loss increases the final classification results and ensures an accurate prediction as well as shows a conspicuous role in the whole proposed SER scheme. We tested the proposed system over two standard, interactive emotional dyadic motion capture (IEMOCAP) and ryerson audio visual database of emotional speech and song (RAVDESS) speech corpora, and obtained a 75% and an 80% recognition rate, respectively.

Read full abstract

Speech emotion recognition (SER) refers to the process of recognizing the emotional state of the speaker from the speech utterance. In earlier studies, wide varieties of cepstral features have been proposed to develop SER systems. The mel-frequency cepstral coefficients (MFCC), and human-factor cepstral coefficients (HFCC) are two popularly used variants of cepstral features. MFCC and HFCC features are extracted from speech signals using mel and human-factor filter banks, respectively. The magnitude response of individual filters in these filter banks is triangular in shape. As a result, these filter banks are referred to as triangular filter banks (TFB) and the corresponding extracted cepstral coefficients can be denoted as TFBCC-M (in case of MFCC) and TFBCC-HF (in case of HFCC). The mel-filter bank (TFB-M) is constructed using mel-scale, while the human-factor filter bank (TFB-HF) is constructed using human factor scale, which is a combination of mel and equivalent rectangular bandwidth (ERB) scales. Similarly, different frequency scales can be used to realize different TFBs to extract different types of TFBCC features. In this direction, this paper proposes two new TFBs denoted as TFB-B and TFB-E, realized using bark and ERB scales to extract new cepstral features referred to as TFBCC-B and TFBCC-E, respectively. The mathematical background to construct the proposed TFB-B and TFB-E is presented. The proposed filter banks are used along with the conventional TFB-M and TFB-HF to extract four different types of TFBCC features. These features are extracted from the emotional speech signals of two databases, namely Berlin database of emotional speech (Emo-DB) and Surrey audio-visual expressed emotion speech database (SAVEE). The extracted features are used to develop speaker-dependent (SD) and speaker-independent (SI) based SER systems using support vector machines. The performance of the respective features is analyzed in terms of isolated and combined usage. The experimental results show that the cepstral features extracted using the proposed TFBs are effective in characterizing and recognizing emotions similar to conventional MFCC and HFCC features. Moreover, the combined use of different cepstral features have resulted to improve the overall recognition performance of SER systems. In case of Emo-DB database, isolated use of the proposed TFBCC-B and TFBCC-E features achieve recognition accuracies of 83.23% and 81.99% for SD scenario, and 75% and 60.94% for SI scenario, respectively. Similarly, for SAVEE database, the recognition accuracies of 75% and 66.67% for SD scenario, and 44.17% and 55% for SI scenario are achieved. In case of Emo-DB database, the maximum recognition accuracies of 86.96% (for different combinations of conventional and proposed features namely, TFBCC-{(M+E), (M+B+E), (HF+B+E), (M+HF+B+E)}) and 77.08% (for combination TFBCC-(M+B+E)) are achieved for SD and SI scenarios, respectively. Similarly, for SAVEE database, the maximum recognition accuracies of 77.08% (for combination TFBCC-(M+HF+E)), and 55.83% (for combination TFBCC-(B+E)) are achieved for SD and SI scenarios, respectively.

Read full abstract

Speech Emotion Recognition System Research Articles

Related Topics

Articles published on Speech Emotion Recognition System

Autoencoder With Emotion Embedding for Speech Emotion Recognition

1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features

Mental Illness Disorder Diagnosis Using Emotion Variation Detection from Continuous English Speech

PBL English micro-audio and video teaching model based on data mining algorithm

The Emotion Recognition System Based on Support Vector Machines

Emotional Speech Recognition using Deep Learning

CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network

Detection of interactive voice response (IVR) in phone call records

MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach

Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques

A modified feature selection method based on metaheuristic algorithms for speech emotion recognition

Improving Speech Emotion Recognition With Adversarial Data Augmentation Network.

Unsupervised feature selection and NMF de-noising for robust Speech Emotion Recognition

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features.

Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation

Emotion Recognition of Manipuri Speech using Convolution Neural Network

Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features

Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and ERB frequency scales

Semi-Supervised Speech Emotion Recognition With Ladder Networks

End-to-End Speech Emotion Recognition With Gender Information

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Speech Emotion Recognition System Research Articles

Related Topics

Articles published on Speech Emotion Recognition System

Autoencoder With Emotion Embedding for Speech Emotion Recognition

1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features

Mental Illness Disorder Diagnosis Using Emotion Variation Detection from Continuous English Speech

PBL English micro-audio and video teaching model based on data mining algorithm

The Emotion Recognition System Based on Support Vector Machines

Emotional Speech Recognition using Deep Learning

CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network

Detection of interactive voice response (IVR) in phone call records

MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach

Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques

A modified feature selection method based on metaheuristic algorithms for speech emotion recognition

Improving Speech Emotion Recognition With Adversarial Data Augmentation Network.

Unsupervised feature selection and NMF de-noising for robust Speech Emotion Recognition

Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features.

Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation

Emotion Recognition of Manipuri Speech using Convolution Neural Network

Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features

Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and ERB frequency scales

Semi-Supervised Speech Emotion Recognition With Ladder Networks

End-to-End Speech Emotion Recognition With Gender Information