Speech-Visual Emotion Recognition by Fusing Shared and Specific Features

Guanghui Chen,Shuang Jiao

doi:10.1109/lsp.2023.3279689

Abstract

With the development of artificial intelligence, it has become a mainstream approach for speech-visual emotion recognition by utilizing neural networks to directly fuse the features of each modality. However, this approach is hard to obtain the shared and specific features of speech-visual modalities, which seriously affects the performance of emotion recognition. To this end, this paper proposes an emotion recognition method by fusing the shared and specific features of speech-visual modalities. In particular, the three-dimensional convolutional neural networks (3D-CNNs) and siamese networks are applied as the feature extraction backbone networks of speech-visual modalities, and the loss function is specially designed to ensure that the proposed method effectively obtains the shared and specific features of speech-visual modalities. Experimental results on the RML, BAUM-1s, eNTERFACE05 datasets show that the proposed method achieves a better result.

Full Text