Asymmetric appearance between positive pair effectively reduces the risk of representation degradation in contrastive learning. However, there are still a mass of appearance similarities between positive pair constructed by the existing methods, thus inhibiting the further representation improvement. To address the above issue, we propose a novel asymmetric patch sampling strategy, which significantly reduces the appearance similarities but retains the image semantics. Specifically, dual patch sampling strategies are respectively applied to the given image. First, sparse patch sampling is conducted to obtain the first view, which reduces spatial redundancy of image and allows a more asymmetric view. Second, a selective patch sampling is proposed to construct another view with large appearance discrepancy relative to the first one. Due to the inappreciable appearance similarities between positive pair, the trained model is encouraged to capture the similarities on semantics, instead of low-level ones.Experimental results demonstrate that our method significantly outperforms the existing self-supervised learning methods on ImageNet-1K and CIFAR datasets, e.g., 2.5% finetuning accuracy improvement on CIFAR100. Furthermore, our method achieves state-of-the-art performance on downstream tasks, object detection and instance segmentation on COCO. Additionally, compared to other self-supervised methods, our method is more efficient on both memory and computation during pretraining. The source code and the trained weights are available at https://github.com/visresearch/aps.
Read full abstract