Weakly-supervised object detection has recently attracted increasing attention since it only requires image-level annotations. However, the performance obtained by existing methods is still far from being satisfactory compared with fully-supervised object detection methods. To achieve a good trade-off between annotation cost and object detection performance, we propose a simple yet effective method which incorporates CNN visualization with click supervision to generate the pseudo ground-truths (i.e., bounding boxes). These pseudo ground-truths can be used to train a fully-supervised detector. To estimate the object scale, we firstly adopt a proposal selection algorithm to preserve high-quality proposals, and then generate Class Activation Maps (CAMs) for these preserved proposals by the proposed CNN visualization algorithm called Spatial Attention CAM. Finally, we fuse these CAMs together to generate pseudo ground-truths and train a fully-supervised object detector with these ground-truths. Experimental results on the PASCAL VOC 2007 and VOC 2012 datasets show that the proposed method can obtain much higher accuracy for estimating the object scale, compared with the state-of-the-art image-level based methods and the center-click based method.
Read full abstract