Abstract

Robust automatic speech emotional-speech recognition architectures based on hybrid convolutional neural networks (CNN) and feedforward deep neural networks are proposed and named in this paper as: BFN, CNA, and HBN. BFN is a combination between bag-of-Audio-word (BoAW) and feedforward deep neural network, CNA based on CNN, finally, HBN is hybrid architecture between BFN and CNA. Overall accuracy is achieved by leveraging Mel-frequency cepstral coefficient features and bag-of-acoustic-words to feed the network, resulting in promising classification performance. In addition, the concatenated output from the proposed hybrid networks is fed into a softmax layer to produce a probability distribution over categorical classifications for speech recognition. The three proposed models are trained on eight emotional classes from the Ryerson Audio-Visual Database of Emotional Speech and Song audio (RAVDESS) dataset. Our proposed models achieved overall precision between 81.5% and 85.5% and overall accuracy between 80.6% and 84.5%, hence outperforming state-of-the-art models using the same dataset.

Highlights

  • Accurate emotional recognition from speech and song files remain a challenging issue

  • MODIFIED SHALLOW MODELS This work is an extension for our previous paper [56] that discussed (SVM, k-nearest neighbor (KNN), and XGBoost) with using MFCC feature extraction followed by BoW output vector as the input for each classifier

  • BFN, CNA, and HBN, were presented for extracting emotional classes from acoustic signals based on Deep learning (DL) techniques

Read more

Summary

INTRODUCTION

Accurate emotional recognition from speech and song files remain a challenging issue. Deep learning (DL) has shown substantial promise in many applications such as social network analysis [1], encryption and decryption [2], forensics [3] and automotive work [4]. Studies such as [5] investigates the exponential stability analysis of Markovian neural networks (MNNs) that can be used to improve many engineering fields, such as communication systems, power systems, production systems, and network control systems. Three new speech emotion recognition architectures are introduced based on feedforward networks with BoAW, CNN, and hybrid networks.

PRIOR RESEARCH
MODIFIED SHALLOW MODELS
EXTREME GRADIENT BOOSTING CLASSIFIER
EXPERIMENTAL SETUP AND SIMULATION RESULTS
Findings
CONCLUSION AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call