Abstract

In this paper, we propose the deep neural network - switching Kalman filter (DNN-SKF) based frameworks for both single modal and multi-modal continuous affective dimension estimation. The DNN-SKF framework firstly models the complex nonlinear relationship between the input (audio, visual, or lexical) features and the affective dimensions via the non-recurrent DNN, then models the temporal dynamics embedded in the emotions via the segmental linear SKF. Affective dimension estimation experiments are carried out on the Audio Visual Emotion Challenge (AVEC2012) database. Single modal estimation results are compared to those from the Support Vector Regression (SVR) models and Bidirectional Long Short-Term Memory Recurrent Neural Network (BLSTM-RNN) models, results show that for all modalities, and for all affective dimensions except for arousal from the audio features, the DNN-SKFs outperform SVR and BLSTM-RNN models in estimating the affective dimensions. Multi-modal estimation results are compared with the state of the art results on the competition of AVEC2012. Results show that both on the development set and test set, the proposed DNN-SKF models obtain the best performance in estimating the affective dimensions. On the test set, with the audio visual features, the average Pearson correlation coefficient (COR) is improved to 0.326 from 0.226 of the linear regression method [1], while with the audio visual and lexical features, the COR is improved to 0.355 from 0.344 of the particle filter fusion method (SVR-PF) [2].

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call