Abstract

Video object detection is a fundamental and important task in computer vision. One mainstay solution for this task is to aggregate features from different frames to enhance the detection on the current frame. Off-the-shelf feature aggregation paradigms for video object detection typically rely on inferring feature-to-feature (Fea2Fea) relations. However, most existing methods are unable to stably estimate Fea2Fea relations due to the appearance deterioration caused by object occlusion, motion blur or rare poses, resulting in limited detection performance. In this paper, we study Fea2Fea relations from a new perspective, and propose a novel dual-level graph relation network (DGRNet) for high-performance video object detection. Different from previous methods, our DGRNet innovatively leverages the residual graph convolutional network to simultaneously model Fea2Fea relations at two different levels including frame level and proposal level, which facilitates performing better feature aggregation in the temporal domain. To prune unreliable edge connections in the graph, we introduce a node topology affinity measure to adaptively evolve the graph structure by mining the local topological information of pairwise nodes. To the best of our knowledge, our DGRNet is the first video object detection method that leverages dual-level graph relations to guide feature aggregation. We conduct experiments on the ImageNet VID dataset and the results demonstrate the superiority of our DGRNet against state-of-the-art methods. Especially, our DGRNet achieves 85.0% mAP and 86.2% mAP with ResNet-101 and ResNeXt-101, respectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call