Abstract

Affective dimension prediction from multi-modal is becoming an increasingly attractive research field in artificial intelligence (AI) and human-computer interaction (HCI) . Previous works have shown that discriminative features from multiple modalities are of importance to accurately recognize emotional states. Recently, deep representations have proved to be effective for emotional state recognition. To investigate new deep spatial-temporal features and evaluate their effectiveness for affective dimension recognition, in this paper, we propose:~(1) combining a pre-trained 2D-CNN and a 1D-CNN for learning deep spatial-temporal features from video images and audio spectrograms; and~(2) a spatial-Temporal Graph Convolutional Networks (ST-GCN) adapted to facial landmarks graph. To evaluate the effectiveness of the proposed spatial-temporal features for affective dimension prediction, we propose Deep Bidirectional Long Short-Term Memory Networks (DBLSTM) model for single-modality prediction, early-fusion and late-fusion predictions. With respect to the liking dimension, we use the text modality for prediction. Experimental results, on the AVEC2019 CES dataset, show that our proposed spatial-temporal features and recognition model obtain promising results. On the development set, the obtained concordance correlation coefficient (CCC) is up to $0.724$ for arousal and $0.705$ for valence, and on the test set, the CCC is $0.513$ for arousal and $0.515$ for valence, which outperform the baseline system with corresponding CCC of $0.355$ and $0.468$ on arousal and valence, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call