Abstract
Affective dimension prediction from multi-modal is becoming an increasingly attractive research field in artificial intelligence (AI) and human-computer interaction (HCI) . Previous works have shown that discriminative features from multiple modalities are of importance to accurately recognize emotional states. Recently, deep representations have proved to be effective for emotional state recognition. To investigate new deep spatial-temporal features and evaluate their effectiveness for affective dimension recognition, in this paper, we propose:~(1) combining a pre-trained 2D-CNN and a 1D-CNN for learning deep spatial-temporal features from video images and audio spectrograms; and~(2) a spatial-Temporal Graph Convolutional Networks (ST-GCN) adapted to facial landmarks graph. To evaluate the effectiveness of the proposed spatial-temporal features for affective dimension prediction, we propose Deep Bidirectional Long Short-Term Memory Networks (DBLSTM) model for single-modality prediction, early-fusion and late-fusion predictions. With respect to the liking dimension, we use the text modality for prediction. Experimental results, on the AVEC2019 CES dataset, show that our proposed spatial-temporal features and recognition model obtain promising results. On the development set, the obtained concordance correlation coefficient (CCC) is up to $0.724$ for arousal and $0.705$ for valence, and on the test set, the CCC is $0.513$ for arousal and $0.515$ for valence, which outperform the baseline system with corresponding CCC of $0.355$ and $0.468$ on arousal and valence, respectively.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.