Abstract

Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.

Highlights

  • Speech is a natural and commonly used medium of interaction among human beings.The importance of speech in communication motivates many researchers to develop methods where speech can be used for human—machine interaction

  • We evaluate the performance of these classifiers in terms of accuracy

  • The speech signals are converted into spectrograms, which are computed by applying the fast Fourier transform (FFT) to emotional speech signals

Read more

Summary

Introduction

Speech is a natural and commonly used medium of interaction among human beings.The importance of speech in communication motivates many researchers to develop methods where speech can be used for human—machine interaction. In different parts of the world, people have different cultural backgrounds, local languages, speaking rates, and speaking styles. This cultural variation creates difficulties in the effective recognition of the emotional states of the speaker and makes the process of speech feature selection very challenging and complex. Acoustic features have been used by researchers for speech emotion recognition (SER) [1]. These acoustic features are further divided into four groups: continuous features (energy, pitch, formants, etc.), spectral features, qualitative features (voice quality), and Teager energy operator-based features

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call