Abstract
The rapid development of emotion recognition contributes to the realization of highly harmonious human-computer interaction experience. Taking into account the complementarity of the emotional information of speech and facial expressions, and breaking through the single modal emotion recognition limitation of single emotional features, this paper proposed a method that combines speech and facial expression features. We CNN and LSTM to learn speech emotion features. Simultaneously, multiple small-scale kernel convolution block was designed to extract facial expression features. Finally, we used DNNs to fuse speech and facial expression features. The multimodal emotion recognition model was tested on the IEMOCAP dataset. Compared with the single modal of speech and facial expression, the overall recognition accuracy of our proposed model has been increased by 10.05% and 11.27%, respectively.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have