Abstract

Weakly supervised object detection (WSOD) has become an effective paradigm, which requires only class labels to train object detectors. However, WSOD detectors are prone to learn highly discriminative features corresponding to local objects rather than complete objects, resulting in imprecise object localization. To address the issue, designing backbones specifically for WSOD is a feasible solution. However, the redesigned backbone generally needs to be pretrained on large-scale ImageNet or trained from scratch, both of which require much more time and computational costs than fine-tuning. In this article, we explore to optimize the backbone without losing the availability of the original pretrained model. Since the pooling layer summarizes neighborhood features, it is crucial to spatial feature learning. In addition, it has no learnable parameters, so its modification will not change the pretrained model. Based on the above analysis, we further propose enhanced spatial feature learning (ESFL) for WSOD, which first takes full advantage of multiple kernels in a single pooling layer to handle multiscale objects and then enhances above-average activations within the rectangular neighborhood to alleviate the problem of ignoring unsalient object parts. The experimental results on the PASCAL VOC and the MS COCO benchmarks demonstrate that ESFL can bring significant performance improvement for the WSOD method and achieve state-of-the-art results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call