A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition.

Mustaqeem Mustaqeem,Soonil Kwon

doi:10.3390/s20010183

Mustaqeem Mustaqeem, Soonil Kwon

Open Access

https://doi.org/10.3390/s20010183

Copy DOI

Journal: Sensors	Publication Date: Dec 28, 2019
Citations: 227	License type: CC BY 4.0

Affiliation: Sejong University

Abstract

Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor. Quantifiable emotion recognition using these sensors from speech signals is an emerging area of research in HCI, which applies to multiple applications such as human-reboot interaction, virtual reality, behavior assessment, healthcare, and emergency call centers to determine the speaker’s emotional state from an individual’s speech. In this paper, we present major contributions for; (i) increasing the accuracy of speech emotion recognition (SER) compared to state of the art and (ii) reducing the computational complexity of the presented SER model. We propose an artificial intelligence-assisted deep stride convolutional neural network (DSCNN) architecture using the plain nets strategy to learn salient and discriminative features from spectrogram of speech signals that are enhanced in prior steps to perform better. Local hidden patterns are learned in convolutional layers with special strides to down-sample the feature maps rather than pooling layer and global discriminative features are learned in fully connected layers. A SoftMax classifier is used for the classification of emotions in speech. The proposed technique is evaluated on Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) datasets to improve accuracy by 7.85% and 4.5%, respectively, with the model size reduced by 34.5 MB. It proves the effectiveness and significance of the proposed SER technique and reveals its applicability in real-world applications.

Highlights

Speech emotion recognition (SER) is the natural and fastest way of exchanging and communication between humans and computers and plays an important role in real-time applications of human-machine interaction
The Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset contains 12 h of audiovisual data, which is divided into five sessions, each session has two actors to record a script in multiple emotions
[14] Convolutional Neural Network (CNN) architecture to develop a deep stride convolutional neural network (DSCNN) model for SER and performs experiments on utterance-based speech spectrograms which is generated from speech signals

Summary

Introduction

Speech emotion recognition (SER) is the natural and fastest way of exchanging and communication between humans and computers and plays an important role in real-time applications of human-machine interaction. Many researchers are working in this domain to make a machine intelligent enough that can understand the state from an individual’s speech to analyze or identify the emotional condition of the speaker. Researchers are trying to finding the robust and salient features for SER using artificial intelligence and deep learning approaches [3] to extracting hidden information, CNN features to trained different CNN models [4,5] to increasing the performance and decreasing the computational complexity of SER for human behavior assessment. The SER have faced many challenges and limitation due to the vast users of social media, low coast and fast bandwidth of the Internet. Due to the usage of low-cost internet and social media occur semantic gape. To cover the semantic gap in this area, Sensors 2020, 20, 183; doi:10.3390/s20010183 www.mdpi.com/journal/sensors

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors

Lead the way for us

Similar Papers

An Ensemble Model for Multi-Level Speech Emotion Recognition
Chunjun Zheng ... Chunli Wang
Applied Sciences | VOL. 10
Chunjun Zheng, et. al.Chunjun Zheng ... Chunli Wang
26 Dec 2019
Applied Sciences | VOL. 10

Research on Speech Emotional Feature Extraction Based on Multidimensional Feature Fusion
Chunjun Zheng ... Wei Sun
-
Chunjun Zheng, et. al.Chunjun Zheng ... Wei Sun
01 Jan 2019
01 Jan 2019

Recognition of Emotions in Speech Using Convolutional Neural Networks on Different Datasets
Marta Zielonka ... Artur Piastowski
Electronics | VOL. 11
Marta Zielonka, et. al.Marta Zielonka ... Artur Piastowski
21 Nov 2022
Electronics | VOL. 11

Head Fusion: Improving the Accuracy and Robustness of Speech Emotion Recognition on the IEMOCAP and RAVDESS Dataset
Mingke Xu ... Fan Zhang
IEEE Access | VOL. 9
Mingke Xu, et. al.Mingke Xu ... Fan Zhang
01 Jan 2020
IEEE Access | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors