Learning with self-derived targets provides a non-contrastive method for unsupervised image representation learning, where the variety in targets is crucial. Recent work has achieved good performance by learning with targets obtained via cluster-balancing. However, the equal-cluster-size constraint becomes too restrictive for handling data with imbalanced categories or coming in small batches. In this paper, we propose a new clustering-based approach for non-contrastive image representation learning with no need for a particular architecture design or extra memory bank and no explicit constraints on cluster size. A key formulation is to learn embedding consistency and variable decorrelation in the cluster space by tweaking the batch-wise cross-correlation matrix towards an identity one. With this identitization loss incorporated, predicted cluster assignments of two randomly augmented views of the same image serve as targets for each other. We carried out comprehensive experimental studies of linear classification with learned representations of benchmark image datasets. Our results show that the proposed approach significantly outperforms state-of-the-art approaches and is more robust to class imbalance than those with cluster balancing.
Read full abstract