Masked Momentum Contrastive Learning for Semantic Understanding by Observation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Large language models (LLMs) have shown excellent performance in zero-shot learning using natural language prompts. However, in the domain of computer vision (CV), the paradigm of pretraining followed by finetuning remains dominant. The aim of this study is to reduce this gap by utilizing the capability of Self-Supervised Learning (SSL) in semantic understanding for zero-shot segmentation, without relying on human-provided labels or vision-language supervision. We introduce a novel evaluation framework that employs visual prompts, including a threshold and a query patch. This framework evaluates the ability of SSL models to derive concepts from observational data. Through this evaluation, we identify the strengths and limitations of SSL models in understanding semantics. Building on the insights from various SSL methods, we further propose the MMC approach to enhance the representations for objects, which integrates Masked image modeling, Momentum-based self-distillation, and global Contrastive learning. MMC achieves a better balance between the inter-object discriminability and the intra-object compactness of learned features. Our experiments on COCO, DAVIS-2017, PASCAL VOC, and ADE20K demonstrate outstanding performance of MMC’s representations.

Save Icon
Up Arrow
Open/Close