Multi-Modal Emotion Recognition From Speech and Facial Expression Based on Deep Learning

Linqin Cai,Min Wei,Jiangong Dong

doi:10.1109/cac51589.2020.9327178

Abstract

The rapid development of emotion recognition contributes to the realization of highly harmonious human-computer interaction experience. Taking into account the complementarity of the emotional information of speech and facial expressions, and breaking through the single modal emotion recognition limitation of single emotional features, this paper proposed a method that combines speech and facial expression features. We CNN and LSTM to learn speech emotion features. Simultaneously, multiple small-scale kernel convolution block was designed to extract facial expression features. Finally, we used DNNs to fuse speech and facial expression features. The multimodal emotion recognition model was tested on the IEMOCAP dataset. Compared with the single modal of speech and facial expression, the overall recognition accuracy of our proposed model has been increased by 10.05% and 11.27%, respectively.

Full Text