When Old Meets New: Emotion Recognition from Speech Signals

Keith April Araño,Peter Gloor,Carlotta Orsenigo,Carlo Vercellis

doi:10.1007/s12559-021-09865-2

Keith April Araño, Peter Gloor + Show 2 more

Open Access

PDF Available

https://doi.org/10.1007/s12559-021-09865-2

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Speech is one of the most natural communication channels for expressing human emotions. Therefore, speech emotion recognition (SER) has been an active area of research with an extensive range of applications that can be found in several domains, such as biomedical diagnostics in healthcare and human–machine interactions. Recent works in SER have been focused on end-to-end deep neural networks (DNNs). However, the scarcity of emotion-labeled speech datasets inhibits the full potential of training a deep network from scratch. In this paper, we propose new approaches for classifying emotions from speech by combining conventional mel-frequency cepstral coefficients (MFCCs) with image features extracted from spectrograms by a pretrained convolutional neural network (CNN). Unlike prior studies that employ end-to-end DNNs, our methods eliminate the resource-intensive network training process. By using the best prediction model obtained, we also build an SER application that predicts emotions in real time. Among the proposed methods, the hybrid feature set fed into a support vector machine (SVM) achieves an accuracy of 0.713 in a 6-class prediction problem evaluated on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) dataset, which is higher than the previously published results. Interestingly, MFCCs taken as unique input into a long short-term memory (LSTM) network achieve a slightly higher accuracy of 0.735. Our results reveal that the proposed approaches lead to an improvement in prediction accuracy. The empirical findings also demonstrate the effectiveness of using a pretrained CNN as an automatic feature extractor for the task of emotion prediction. Moreover, the success of the MFCC-LSTM model is evidence that, despite being conventional features, MFCCs can still outperform more sophisticated deep-learning feature sets.

Highlights

Sentiment analysis and affective computing have been receiving a surge of interest from both the academic and business communities in recent years due to the proliferation of opinion-rich social media data and their increasing applications in different use cases
We investigated the use of a hybrid feature set to classify emotions from speech, a fusion of mel-frequency cepstral coefficients (MFCCs) and deep-learned features that were extracted from images depicting speech spectrograms by using a pretrained convolutional neural network (CNN) model, namely ResNet50
We investigated the fusion of speech MFCCs and image features extracted from signal spectrograms for the task of emotion recognition

Summary

Introduction

Sentiment analysis and affective computing have been receiving a surge of interest from both the academic and business communities in recent years due to the proliferation of opinion-rich social media data and their increasing applications in different use cases The research in this field has been traditionally focused on analyzing textual data. Finding effective features and training machine learning models that can generalize well to realworld applications are still challenging tasks. This undertaking, along with the emergence of speech-based virtual assistants that provide readily available platforms for voice-based emotion recognition systems [3], drives a growing interest in the speech emotion recognition (SER) body of research [2]

Objectives

Methods

Results

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Cognitive Computation	Publication Date: Apr 19, 2021
Citations: 21	License type: open-access

R Discovery Prime

When Old Meets New: Emotion Recognition from Speech Signals

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Cognitive Computation

Lead the way for us

Similar Papers

Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN
U Kumaran ... Senthil Murugan Nagarajan
International Journal of Speech Technology | VOL. 24
U Kumaran, et. al.U Kumaran ... Senthil Murugan Nagarajan
13 Jan 2021
International Journal of Speech Technology | VOL. 24

Speech Emotion Recognition Based on Multiple Acoustic Features and Deep Convolutional Neural Network
Kishor Bhangale ... Mohanaprasad Kothandaraman
Electronics | VOL. 12
Kishor Bhangale, et. al.Kishor Bhangale ... Mohanaprasad Kothandaraman
07 Feb 2023
Electronics | VOL. 12

Emotion recognition of human speech using deep learning method and MFCC features
Sumon Kumar Hazra ... Nasim Adnan
Radioelectronic and Computer Systems | VOL. -
Sumon Kumar Hazra, et. al.Sumon Kumar Hazra ... Nasim Adnan
29 Nov 2022
Radioelectronic and Computer Systems | VOL. -

The Effects of Normalisation Methods on Speech Emotion Recognition
Tshephisho Joseph Sefara
-
Tshephisho Joseph SefaraTshephisho Joseph Sefara
01 Nov 2019
01 Nov 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

When Old Meets New: Emotion Recognition from Speech Signals

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Cognitive Computation