Emotional Speech Database Research Articles

Speech emotion recognition (SER) is a technology that can be applied to distance education to analyze speech patterns and evaluate speakers’ emotional states in real time. It provides valuable insights and can be used to enhance students’ learning experiences by enabling the assessment of their instructors’ emotional stability, a factor that significantly impacts the effectiveness of information delivery. Students demonstrate different engagement levels during learning activities, and assessing this engagement is important for controlling the learning process and improving e-learning systems. An important aspect that may influence student engagement is their instructors’ emotional state. Accordingly, this study used deep learning techniques to create an automated system for recognizing instructors’ emotions in their speech when delivering distance learning. This methodology entailed integrating transformer, convolutional neural network, and long short-term memory architectures into an ensemble to enhance the SER. Feature extraction from audio data used Mel-frequency cepstral coefficients; chroma; a Mel spectrogram; the zero-crossing rate; spectral contrast, centroid, bandwidth, and roll-off; and the root-mean square, with subsequent optimization processes such as adding noise, conducting time stretching, and shifting the audio data. Several transformer blocks were incorporated, and a multi-head self-attention mechanism was employed to identify the relationships between the input sequence segments. The preprocessing and data augmentation methodologies significantly enhanced the precision of the results, with accuracy rates of 96.3%, 99.86%, 96.5%, and 85.3% for the Ryerson Audio–Visual Database of Emotional Speech and Song, Berlin Database of Emotional Speech, Surrey Audio–Visual Expressed Emotion, and Interactive Emotional Dyadic Motion Capture datasets, respectively. Furthermore, it achieved 83% accuracy on another dataset created for this study, the Saudi Higher-Education Instructor Emotions dataset. The results demonstrate the considerable accuracy of this model in detecting emotions in speech data across different languages and datasets.

Read full abstract

AbstractSpeech emotion recognition (SER) is an important research problem in human‐computer interaction systems. The representation and extraction of features are significant challenges in SER systems. Despite the promising results of recent studies, they generally do not leverage progressive fusion techniques for effective feature representation and increasing receptive fields. To mitigate this problem, this article proposes DeepCNN, which is a fusion of spectral and temporal features of emotional speech by parallelising convolutional neural networks (CNNs) and a convolution layer‐based transformer. Two parallel CNNs are applied to extract the spectral features (2D‐CNN) and temporal features (1D‐CNN) representations. A 2D‐convolution layer‐based transformer module extracts spectro‐temporal features and concatenates them with features from parallel CNNs. The learnt low‐level concatenated features are then applied to a deep framework of convolutional blocks, which retrieves high‐level feature representation and subsequently categorises the emotional states using an attention gated recurrent unit and classification layer. This fusion technique results in a deeper hierarchical feature representation at a lower computational cost while simultaneously expanding the filter depth and reducing the feature map. The Berlin Database of Emotional Speech (EMO‐BD) and Interactive Emotional Dyadic Motion Capture (IEMOCAP) datasets are used in experiments to recognise distinct speech emotions. With efficient spectral and temporal feature representation, the proposed SER model achieves 94.2% accuracy for different emotions on the EMO‐BD and 81.1% accuracy on the IEMOCAP dataset respectively. The proposed SER system, DeepCNN, outperforms the baseline SER systems in terms of emotion recognition accuracy on the EMO‐BD and IEMOCAP datasets.

Read full abstract

Emotional Speech Database Research Articles

Related Topics

Articles published on Emotional Speech Database

Optimized Multimodal Emotional Recognition Using Long Short-Term Memory

Scalability and diversity of StarGANv2-VC in Arabic emotional voice conversion: Overcoming data limitations and enhancing performance

Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms

Enhancing speech emotion recognition with deep learning using multi-feature stacking and data augmentation

Emotions recognition in audio signals using an extension of the latent block model

Convolutional Neural Network Architectures for Gender, Emotional Detection from Speech and Speaker Diarization

Multimodal Emotion Recognition via Convolutional Neural Networks: Comparison of different strategies on two multimodal datasets

Speech Emotion Recognition using Extreme Machine Learning

A Feature Selection Algorithm Based on Differential Evolution for English Speech Emotion Recognition

English Speech Emotion Classification Based on Multi-Objective Differential Evolution

Comparison of Various Feature Selection Algorithms in Speech Emotion Recognition

Spoken emotion recognition through human-computer interaction using a novel deep learning technology

The Dysarthric Expressed Emotional Database (DEED): An audio-visual database in British English.

Speech emotion recognition with light gradient boosting decision trees machine

Enhancing Human-Machine Interaction: Real-Time Emotion Recognition through Speech Analysis

To Design and Develop Advance Speech Emotion Recognition using MLP Classifier with Evolutionary LIBROSA Library

SER: Performance Evaluation of CNN Model Along with an Overview of Available Indic Speech Datasets, and Transition of Classifiers From Traditional to Modern Era

Speech emotion recognition using multiple classification models based on MFCC feature values

DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Artificial Intelligent for Human Emotion Detection with the Mel-Frequency Cepstral Coefficient (MFCC)

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Emotional Speech Database Research Articles

Related Topics

Articles published on Emotional Speech Database

Optimized Multimodal Emotional Recognition Using Long Short-Term Memory

Scalability and diversity of StarGANv2-VC in Arabic emotional voice conversion: Overcoming data limitations and enhancing performance

Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms

Enhancing speech emotion recognition with deep learning using multi-feature stacking and data augmentation

Emotions recognition in audio signals using an extension of the latent block model

Convolutional Neural Network Architectures for Gender, Emotional Detection from Speech and Speaker Diarization

Multimodal Emotion Recognition via Convolutional Neural Networks: Comparison of different strategies on two multimodal datasets

Speech Emotion Recognition using Extreme Machine Learning

A Feature Selection Algorithm Based on Differential Evolution for English Speech Emotion Recognition

English Speech Emotion Classification Based on Multi-Objective Differential Evolution

Comparison of Various Feature Selection Algorithms in Speech Emotion Recognition

Spoken emotion recognition through human-computer interaction using a novel deep learning technology

The Dysarthric Expressed Emotional Database (DEED): An audio-visual database in British English.

Speech emotion recognition with light gradient boosting decision trees machine

Enhancing Human-Machine Interaction: Real-Time Emotion Recognition through Speech Analysis

To Design and Develop Advance Speech Emotion Recognition using MLP Classifier with Evolutionary LIBROSA Library

SER: Performance Evaluation of CNN Model Along with an Overview of Available Indic Speech Datasets, and Transition of Classifiers From Traditional to Modern Era

Speech emotion recognition using multiple classification models based on MFCC feature values

DeepCNN: Spectro‐temporal feature representation for speech emotion recognition

Artificial Intelligent for Human Emotion Detection with the Mel-Frequency Cepstral Coefficient (MFCC)