Abstract

With the development of artificial intelligence, it has become a mainstream approach for speech-visual emotion recognition by utilizing neural networks to directly fuse the features of each modality. However, this approach is hard to obtain the shared and specific features of speech-visual modalities, which seriously affects the performance of emotion recognition. To this end, this paper proposes an emotion recognition method by fusing the shared and specific features of speech-visual modalities. In particular, the three-dimensional convolutional neural networks (3D-CNNs) and siamese networks are applied as the feature extraction backbone networks of speech-visual modalities, and the loss function is specially designed to ensure that the proposed method effectively obtains the shared and specific features of speech-visual modalities. Experimental results on the RML, BAUM-1s, eNTERFACE05 datasets show that the proposed method achieves a better result.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call