Multichannel CNN-BLSTM Architecture for Speech Emotion Recognition System by Fusion of Magnitude and Phase Spectral Features Using DCCA for Consumer Applications

Gudmalwar Ashishkumar Prabhakar,Ch V Rama Rao,Biplove Basel,Anirban Dutta

doi:10.1109/tce.2023.3236972

Abstract

Conventional Speech Emotion Recognition (SER) approaches put more emphasis on extracting magnitude spectrum-based features, such as Mel Frequency Cepstral Coefficients (MFCCs), and Mel spectrogram. However, phase information is ignored due to signal processing difficulties such as the phase wrapping issue. This work develops a multichannel Convolution Neural Network-Bidirectional Long Short Term Memory (CNN-BLSTM) architectures with an attention mechanism for speaker-independent SER by considering phase and magnitude spectrum-based features. The phase-based features are extracted using the Modified Group Delay Function (MODGD). The obtained phase features are combined with MFCC features. The CNN-BLSTM network extract learned representation from magnitude and phase features. The learned representation from MFCCs and MODGD are combined and given as an input to the Support Vector Machine (SVM) for classification. The Deep Canonical Correlation Analysis (DCCA) is used to maximize the correlation between magnitude and phase information to improve the conventional SER system’s performance. Here the IEMOCAP database is used for performance analysis. The experimental results show improvement over MFCC features and existing approaches for unimodal SER. In this work, we also developed real-time web server application for the proposed architecture.

Full Text