Abstract

Self-supervised pre-training has attracted increasing attention given its promising performance in training backbone networks without using labels. By far, most methods focus on image classification with datasets containing iconic objects and simple background, e.g. ImageNet. However, these methods show sub-optimal performance for dense prediction tasks (e.g. object detection and scene parsing) when directly pre-training on datasets (e.g. PASCAL VOC and COCO) with multiple objects and cluttered backgrounds. Researchers explored self-supervised dense pre-training methods by adapting recent image pre-training methods. Nevertheless, they require a large number of negative samples and a long training time to reach reasonable performance. In this paper, we propose LPCL, a novel self-supervised representation learning method for dense predictions to settle these issues. To guide the instance information in multi-instance datasets, we define an online object patch selection module to select the local patches with the high possibility of containing instance area in the augmented views efficiently during learning. After obtaining the patches, we present a novel multi-level contrastive learning method considering the instance representation of global-level, local-level and position-level without using negative samples. We conduct extensive experiments with LPCL directly pre-trained on PASCAL VOC and COCO. For PASCAL VOC image classification task, our model achieves state-of-the-art 86.2% accuracy pre-trained on COCO(+9.7% top-1 accuracy compared with baseline BYOL). On object detection, instance segmentation and semantic segmentation task, our proposed model also achieved competitive results compared with other state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call