Speech emotion recognition using the novel PEmoNet (Parallel Emotion Network)

Kishor B Bhangale,Mohanaprasad Kothandaraman

doi:10.1016/j.apacoust.2023.109613

Abstract

Emotions are very crucial for humans for expressing perception and daily activities such as communication, learning, and decision-making. Human emotion recognition using machines is a very complex task. Recently deep learning techniques have been widely used to automate this task by providing machines with a huge learning capability. However, Speech emotion recognition (SER) is challenging due to language, regional, gender, age, and cultural variations. Most of the previous SER techniques have used only one type of feature representation to train deep learning algorithms, which limits the performance of SER. This paper presents a novel Parallel Emotion Network (PEmoNet) that includes Deep Convolution Neural Network (DCNN) with three parallel arms to address effective SER. The three parallel arms of the proposed PEmoNet accept the Multitaper Mel Frequency Spectrogram (MTMFS), Gammatonegram spectrogram (GS), and Constant Q-Transform Spectrogram (CQTS) as input to improve the feature distinctiveness of the emotion signal. The performance of the proposed SER scheme is evaluated on EMODB and RAVDESS datasets based on accuracy, recall, precision, and F1-score. The proposed technique shows 97.14% and 97.41% accuracy for the EMODB and RAVDESS datasets. It shows that the proposed PEmoNet with different spectral representation inputs helps improve the emotions' distinctiveness and outperforms the existing state of the arts.

Full Text