A Novel Approach for Classification of Speech Emotions Based on Deep and Acoustic Features

Mehmet Bilal Er

doi:10.1109/access.2020.3043201

Abstract

The problem of recognition and classification of emotions in speech is one of the most prominent research topics, that has gained popularity, in human-computer interaction in the last decades. Having recognized the feelings or emotions in human conversations might have a deep impact on understanding a human’s physical and psychological situation. This study proposes a novel hybrid architecture based on acoustic and deep features to increase the classification accuracy in the problem of speech emotion recognition. The proposed method consists of feature extraction, feature selection and classification stages. At first, acoustic features such as Root Mean Square energy (RMS), Mel-Frequency Cepstral Coefficients (MFCC) and Zero-crossing Rate are obtained from voice records. Subsequently, spectrogram images of the original sound signals are given as input to the pre-trained deep network architecture, which is VGG16, ResNet18, ResNet50, ResNet101, SqueezeNet and DenseNet201 and deep features are extracted. Thereafter, a hybrid feature vector is created by combining acoustic and deep features. Also, the ReliefF algorithm is used to select more efficient features from the hybrid feature vector. Finally, in order for the completion of the classification task, Support vector machine (SVM) is used. Experiments are made using three popular datasets used in the literature so as to evaluate the effect of various techniques. These datasets are Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), Berlin (EMO-DB) and Interactive Emotional Dyadic Motion Capture (IEMOCAP). As a consequence, we reach to 79.41%, 90.21% and 85.37% accuracy rates for RAVDESS, EMO-DB, and IEMOCAP datasets, respectively. The Final results obtained in experiments, clearly, show that the proposed technique might be utilized to accomplish the task of speech emotion recognition efficiently. Moreover, when our technique is compared with those of methods used in the context, it is obvious that our method outperforms others in terms of classification accuracy rates.

Highlights

Speaking is the basic means of human interaction which is fast and efficient
PROPOSED METHOD The method presented in this study consists of acoustic features, deep features, pre-trained Convolutional Neural Network (CNN) and Support vector machine (SVM) combined model
VGG16{fc7}+Acoustic features with ReliefF reached to an accuracy rate of 74.41%, ResNet18{fc1000}+Acoustic features reached to an accuracy rate of 75.38%, ResNet50 {fc1000}+Acoustic features with ReliefF reached to an accuracy rate of 78.26%, SqueezeNet{pool10}+Acoustic features with ReliefF reached to an accuracy rate of 75.81% and DenseNet201{conv5_block16}+Acoustic features with ReliefF reached to an accuracy rate of 77.46%

Summary

INTRODUCTION

Speaking is the basic means of human interaction which is fast and efficient. During the speech, air flows through the trachea from the lungs to the larynx, and this air flow creates speech signals by vibrating the vocal cords [1]. Despite a large number of researches done and advances taken in emotion recognition in recent years, it is still not quite known what the most appropriate method is likely to be. This situation is induced by the subjectivity of emotions. Various types of machine learning techniques are employed for comprehending the relation among the extracted features of speech and predetermined emotion tags. In these studies, SVM, hidden Markov models (HMMs) and neural networks are employed.

RELATED WORKS

SPECTROGRAM EXTRACTION

DEEP FEATURE EXTRACTION FROM PRE-TRAINED CNN MODELS

CLASSIFICATION

EXPERIMENTAL RESULTS

Findings

CONCLUSION