Emotional Speech Database Research Articles

In recent years, speech recognition technology has become a more common notion. Speech quality and intelligibility are critical for the convenience and accuracy of information transmission in speech recognition. The speech processing systems used to converse or store speech are usually designed for an environment without any background noise. However, in a real-world atmosphere, background intervention in the form of background noise and channel noise drastically reduces the performance of speech recognition systems, resulting in imprecise information transfer and exhausting the listener. When communication systems’ input or output signals are affected by noise, speech enhancement techniques try to improve their performance. To ensure the correctness of the text produced from speech, it is necessary to reduce the external noises involved in the speech audio. Reducing the external noise in audio is difficult as the speech can be of single, continuous or spontaneous words. In automatic speech recognition, there are various typical speech enhancement algorithms available that have gained considerable attention. However, these enhancement algorithms work well in simple and continuous audio signals only. Thus, in this study, a hybridized speech recognition algorithm to enhance the speech recognition accuracy is proposed. Non-linear spectral subtraction, a well-known speech enhancement algorithm, is optimized with the Hidden Markov Model and tested with 6660 medical speech transcription audio files and 1440 Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) audio files. The performance of the proposed model is compared with those of various typical speech enhancement algorithms, such as iterative signal enhancement algorithm, subspace-based speech enhancement, and non-linear spectral subtraction. The proposed cascaded hybrid algorithm was found to achieve a minimum word error rate of 9.5% and 7.6% for medical speech and RAVDESS speech, respectively. The cascading of the speech enhancement and speech-to-text conversion architectures results in higher accuracy for enhanced speech recognition. The evaluation results confirm the incorporation of the proposed method with real-time automatic speech recognition medical applications where the complexity of terms involved is high.

Read full abstract

Recent analysis on speech emotion recognition (SER) has made considerable advances with the use of MFCC’s spectrogram features and the implementation of neural network approaches such as convolutional neural networks (CNNs). The fundamental issue of CNNs is that the spatial information is not recorded in spectrograms. Capsule networks (CapsNet) have gained gratitude as alternatives to CNNs with their larger capacities for hierarchical representation. However, the concealed issue of CapsNet is the compression method that is employed in CNNs cannot be directly utilized in CapsNet. To address these issues, this research introduces a text-independent and speaker-independent SER novel architecture, where a dual-channel long short-term memory compressed-CapsNet (DC-LSTM COMP-CapsNet) algorithm is proposed based on the structural features of CapsNet. Our proposed novel classifier can ensure the energy efficiency of the model and adequate compression method in speech emotion recognition, which is not delivered through the original structure of a CapsNet. Moreover, the grid search (GS) approach is used to attain optimal solutions. Results witnessed an improved performance and reduction in the training and testing running time. The speech datasets used to evaluate our algorithm are: Arabic Emirati-accented corpus, English “speech under simulated and actual stress (SUSAS)” corpus, English Ryerson audio-visual database of emotional speech and song (RAVDESS) corpus, and crowd-sourced emotional multimodal actors dataset (CREMA-D). This work reveals that the optimum feature extraction method compared to other known methods is MFCCs delta-delta. Using the four datasets and the MFCCs delta-delta, DC-LSTM COMP-CapsNet surpasses all the state-of-the-art systems, classical classifiers, CNN, and the original CapsNet. Using the Arabic Emirati-accented corpus, our results demonstrate that the proposed work yields average emotion recognition accuracy of 89.3% compared to 84.7%, 82.2%, 69.8%, 69.2%, 53.8%, 42.6%, and 31.9% based on CapsNet, CNN, support vector machine (SVM), multi-layer perceptron (MLP), k-nearest neighbor (KNN), radial basis function (RBF), and naïve Bayes (NB), respectively.

Read full abstract

Emotional Speech Database Research Articles

Related Topics

Articles published on Emotional Speech Database

Speech emotion recognition based on genetic algorithm–decision tree fusion of deep and acoustic features

A Review on Emotional Speech Databases

A Novel Classification Method with Cubic Spline Interpolation

A Novel S-LDA Features for Automatic Emotion Recognition from Speech using 1-D CNN

A Perspective Study on Speech Emotion Recognition: Databases, Features and Classification Models

Punjabi Emotional Speech Database:Design, Recording and Verification

Emotional voice conversion: Theory, databases and ESD

Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody

Effect on speech emotion classification of a feature selection approach using a convolutional neural network.

The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning.

A Hybrid Speech Enhancement Algorithm for Voice Assistance Application.

Novel dual-channel long short-term memory compressed capsule networks for emotion recognition

Multi-Modal Residual Perceptron Network for Audio-Video Emotion Recognition.

Feature compensation based on the normalization of vocal tract length for the improvement of emotion-affected speech recognition

Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

A new proposed statistical feature extraction method in speech emotion recognition

Feature Specific Hybrid Framework on composition of Deep learning architecture for speech emotion recognition

Speech Emotion Recognition System

When Old Meets New: Emotion Recognition from Speech Signals

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Emotional Speech Database Research Articles

Related Topics

Articles published on Emotional Speech Database

Speech emotion recognition based on genetic algorithm–decision tree fusion of deep and acoustic features

A Review on Emotional Speech Databases

A Novel Classification Method with Cubic Spline Interpolation

A Novel S-LDA Features for Automatic Emotion Recognition from Speech using 1-D CNN

A Perspective Study on Speech Emotion Recognition: Databases, Features and Classification Models

Punjabi Emotional Speech Database:Design, Recording and Verification

Emotional voice conversion: Theory, databases and ESD

Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody

Effect on speech emotion classification of a feature selection approach using a convolutional neural network.

The Mexican Emotional Speech Database (MESD): elaboration and assessment based on machine learning.

A Hybrid Speech Enhancement Algorithm for Voice Assistance Application.

Novel dual-channel long short-term memory compressed capsule networks for emotion recognition

Multi-Modal Residual Perceptron Network for Audio-Video Emotion Recognition.

Feature compensation based on the normalization of vocal tract length for the improvement of emotion-affected speech recognition

Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

A new proposed statistical feature extraction method in speech emotion recognition

Feature Specific Hybrid Framework on composition of Deep learning architecture for speech emotion recognition

Speech Emotion Recognition System

When Old Meets New: Emotion Recognition from Speech Signals