Abstract

The rapid development of emotion recognition contributes to the realization of highly harmonious human-computer interaction experience. Taking into account the complementarity of the emotional information of speech and facial expressions, and breaking through the single modal emotion recognition limitation of single emotional features, this paper proposed a method that combines speech and facial expression features. We CNN and LSTM to learn speech emotion features. Simultaneously, multiple small-scale kernel convolution block was designed to extract facial expression features. Finally, we used DNNs to fuse speech and facial expression features. The multimodal emotion recognition model was tested on the IEMOCAP dataset. Compared with the single modal of speech and facial expression, the overall recognition accuracy of our proposed model has been increased by 10.05% and 11.27%, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call