Abstract

In this study, we have presented a deep learning-based implementation for speech emotion recognition (SER). The system combines a deep convolutional neural network (DCNN) and a bidirectional long-short term memory (BLSTM) network with a time-distributed flatten (TDF) layer. The proposed model has been applied for the recently built audio-only Bangla emotional speech corpus SUBESCO. A series of experiments were carried out to analyze all the models discussed in this paper for baseline, cross-lingual, and multilingual training-testing setups. The experimental results reveal that the model with a TDF layer achieves better performance compared with other state-of-the-art CNN-based SER models which can work on both temporal and sequential representation of emotions. For the cross-lingual experiments, cross-corpus training, multi-corpus training, and transfer learning were employed for the Bangla and English languages using the SUBESCO and RAVDESS datasets. The proposed model has attained a state-of-the-art perceptual efficiency achieving weighted accuracies (WAs) of 86.9%, and 82.7% for the SUBESCO and RAVDESS datasets, respectively.

Highlights

  • Identifying human emotions from voice signals, using a machine learning approach, is important to construct a naturallike human-computer interaction (HCI) system

  • The weighted accuracies (WAs) accuracy achieved for this model is 86.86%, and the average f1 score is 86.86%

  • In comparisons with other models tested here, we found that the seven-layer CONVOLUTIONAL NEURAL NETWORK (CNN) architectures provided comparable performance at a fraction of the training time needed for other architectures

Read more

Summary

Introduction

Identifying human emotions from voice signals, using a machine learning approach, is important to construct a naturallike human-computer interaction (HCI) system. Selecting the appropriate features for classifying emotions accurately is the most crucial design decision. Have been employed to construct SER in various studies [3]. Those features can be further classified as temporal (time-domain) and spectral (frequency-domain) features. The ultimate result of SER is obtained by the use of a classifier, which allows the system to determine the best match for input emotional speech. Hidden Markov Model (HMM), Support Vector Machines (SVM), Gaussian Mixture Model (GMM), Artificial Neural Network (ANN), decision trees, and ensemble approaches are some wellknown classifiers that have been employed in previous stud-

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.