Combining multi-representation for multimedia event detection using co-training

Yi Bin,Yang Yang,Fumin Shen,Xing Xu

doi:10.1016/j.neucom.2016.03.091

Abstract

In recent years, multimedia event detection has been attracting extensive research attention because of the exponential increase in volume of web video data. Traditional approaches usually utilize single visual representation, which may suffer from the problem of insufficient descriptive power. How to jointly employ multiple types of visual representation to facilitate multimedia event detection (MED) in videos remains an open problem. In this work, we propose a novel system for event detection based on combination of multi-view representations and co-training algorithm. Specifically, given several types of low-level visual features (i.e., Convolutional Neural Networks (CNNs) and Fisher vector), we first train an initial classifier for each type of visual feature. Then, we use these classifiers to separately predict labels of unlabeled videos, and those with consistent prediction are merged into the training set. We alternatively repeat the processes of training the classifiers and enlarging the training set until convergence. To investigate the relationship among different types of visual features, the prediction scores of the two classifiers are fused by a linear weighted fusion method. We evaluate our MED system on the TRECVID MED11 data set, and the experimental results have demonstrated the outstanding performance of the proposed approach as compared to several other state-of-the-art algorithms.

Full Text