Abstract

Video object detection aims at accurately localizing the objects in videos and correctly recognizing their categories. Off-the-shelf video object detection methods have made some progress in recent years but they still suffer from the problems of inaccurate object localization, incorrect object recognition or insufficient relation learning, resulting in limited detection performance. In this paper, we propose a novel triple-cooperative network (TCNet) for high-performance video object detection, with three substantial improvements to ameliorate the problems of existing methods. First, we develop a context-aware proposal refinement module to generate high-quality proposals, enabling our TCNet to achieve more accurate object localization. Second, we present a similarity-aware semantic distillation module that innovatively leverages the semantic knowledge of class labels as additional supervisory signals to enhance the object recognition ability of our TCNet. Third, we design a structure-aware relation learning module to effectively model the structural relations between features with an adaptive-pruning residual graph convolutional network, making our TCNet perform more effective feature aggregation. We conduct extensive experiments on the challenging ImageNet VID dataset and the experimental results demonstrate that our TCNet outperforms current state-of-the-art methods. More remarkably, our TCNet achieves 85.2% mAP and 86.3% mAP with ResNet-101 and ResNeXt-101, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call