Exploiting Mid-Level Semantics for Large-Scale Complex Video Classification

Ji Zhang,Yu Zheng,Jianping Fan,Kuizhi Mei

doi:10.1109/tmm.2019.2907453

Abstract

As the amount of available video data has grown substantially, automatic video classification has become an urgent yet challenging task. Most video classification methods focus on acquiring discriminative spacial visual features and motion patterns for video representation, especially deep learning methods, which have achieved very good results on action recognition problems. However, the performance of most of these methods drastically degenerates for more generic video classification tasks where the video contents are much more complex. Thus, in this paper, the mid-level semantics of videos are exploited to bridge the semantic gap between low-level features and high-level video semantics. Inspired by the term ``frequency-inverse document frequency'', a word weighting method for the problem of text classification is introduced to the video domain. The visual objects in videos are regarded as the words in texts, and two new weighting methods are proposed to encode videos by weighting visual objects according to the characteristics of videos. In addition, the semantic similarities between video categories and visual objects are introduced from the text domain as privileged information to facilitate classifier training on the obtained semantic representations of videos. The proposed semantic encoding method (semantic stream) is then fused with the popular two-stream CNN model for the final classification results. Experiments are conducted on two large-scale complex video datasets, CCV and ActivityNet. The experimental results validate the effectiveness of the proposed methods.

Full Text