A Hybrid Acoustic Model for Effective Speech Emotion Classification by the Deep Fusion of Spectral and Prosodic Features Using CNN and DNN

Maganti Syamala,N J Nalini,Lakshmana Phaneendra Maguluri

doi:10.1007/978-981-16-9885-9_51

Abstract

AbstractTo explore and enhance the need of speech emotion classification (SEC) for improving customer quality of service, this paper classified the speech into different emotions by analyzing speech features using neural network fusion-based feature extraction mechanism. This paper proposed a deep learning framework by fusing deep features extracted from spectrograms and prosodic features. Implemented a 2D convolutional neural network (CNN) to extract features from Mel-scale spectrogram. The drawbacks in the traditional methods are dealt by the fusion of CNN with deep neural network (DNN) for deep feature extraction. The experimental results are derived from RAVDESS emotional speech database and demonstrate the significant emotion accuracy improvement by combining para-lingual spectrogram and prosodic features using CNN and DNN. Compare the proposed model in terms of classification accuracy using various machine learning and deep learning techniques. Comparison was made with the traditional models where features are extracted from chroma gram, spectrogram, Mel-frequency cepstral coefficient (MFCC), and tonal centroid feature extraction techniques. The results obtained from the proposed model are outperformed when measured in terms of accuracy, precision, recall, and F-score. The classification accuracy of deep fusion feature extraction is 83%, and there is a comparable difference with the traditional state-of-the-art models.KeywordsClassificationEmotionsExtractionFeaturePara-lingualQualityRecommenderSpeechMachine learningValidation

Full Text