Abstract

Proposing a practical method for high-performance emotion recognition could facilitate human–computer interaction. Among existing methods, deep learning techniques have improved the performance of emotion recognition systems. In this work, a new multimodal neural design is presented wherein audio and visual data are combined as the input to a hybrid network comprised of a bidirectional long short term memory (BiLSTM) network and two convolutional neural networks (CNNs). The spatial and temporal features extracted from video frames are fused with Mel-Frequency Cepstral Coefficients (MFCCs) and energy features extracted from audio signals and BiLSTM network outputs. Finally, a Softmax classifier is used to classify inputs into the set of target categories. The proposed model is evaluated on Surrey Audio–Visual Expressed Emotion (SAVEE), Ryerson Audio–Visual Database of Emotional Speech and Song (RAVDESS), and Ryerson Multimedia research Lab (RML) databases. Experimental results on these datasets prove the effectiveness of the proposed model where it achieves the accuracy of 99.75%, 94.99%, and 99.23% for the SAVEE, RAVDESS, and RML databases, respectively. Our experimental study reveals that the suggested method is more effective than existing algorithms in adapting to emotion recognition in these datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call