Exploiting Better Feature Aggregation for Video Object Detection

Liang Han,Pichao Wang,Fan Wang,Hao Li,Zhaozheng Yin

doi:10.1145/3394171.3413927

Abstract

Video object detection (VOD) has been a rising topic in recent years due to the challenges such as occlusion, motion blur, etc. To deal with these challenges, feature aggregation from local or global support frames is verified effective. To exploit better feature aggregation, in this paper, we propose two improvements over previous works: a class-constrained spatial-temporal relation network and a correlation-based feature alignment module. For the class constrained spatial-temporal relation network, it operates on object region proposals, and learns two kinds of relations: (1) the dependencies among region proposals of the same object class from support frames sampled in a long time range or even the whole sequence, and (2) spatial relations among proposals of different objects in the target frame. The homogeneity constraint in spatial-temporal relation network not only filters out many defective proposals but also implicitly embeds the traditional post-processing strategies (e.g., Seq-NMS), leading to a unified end-to-end training networks. In the feature alignment module, we propose a correlation based feature alignment method to align the support and target frames for feature aggregation in the temporal domain. Our experiments show that the proposed method improves the accuracy of single-frame detectors significantly, and outperforms previous temporal or spatial relation networks. Without bells or whistles, the proposed method achieves state-of-the-art performance on the ImageNet VID dataset (84.80% with ResNet-101) without any post-processing methods.

Full Text