Affective Video Content Analysis via Multimodal Deep Quality Embedding Network

Yaochen Zhu,Zhenzhong Chen,Feng Wu

doi:10.1109/taffc.2020.3004114

Abstract

The establishment of large video affective content analysis datasets, such as LIRIS-ACCEDE, opens up the possibility of utilizing the massive representation power of deep neural networks (DNNs) to model the complex process of eliciting affective responses from video viewers. However, label noise in these datasets poses a considerable challenge to both the training and evaluation of DNNs. The optimization of DNNs requires stochastic gradient descent (SGD), but label noise in the training set leads to an inaccurate estimate of the gradient, which may cause the model to converge to a nonoptima. In addition, label noise in the test set renders the results of model evaluation untrustworthy. In this article, we propose a multimodal deep quality embedding network (MMDQEN) for affective video content analysis. Specifically, MMDQEN can infer the latent label and label quality from the noisy training samples so that cleaner supervision signals are provided to the DNN-based affective classifier, and a tractable objective for MMDQEN is derived with variational inference and conditional independence assumption. In addition, to avoid model evaluation bias incurred by the annotation noise in the test set, new test sets based on the original LIRIS-ACCEDE database, which we name LIRIS-ACCEDE-RANK, are established where the samples are ranked according to their label uncertainty level, with corresponding evaluation metrics introduced accordingly to further reveal the performance of different models. Experiments conducted on both the LIRIS-ACCEDE and the LIRIS-ACCEDE-RANK datasets demonstrate the effectiveness of the proposed method.

Full Text