Learning effective visual representations without human supervision is a critical problem for the task of semantic segmentation of remote sensing images (RSIs), where pixel-level annotations are difficult to obtain. Self-supervised learning (SSL), which learns useful representations by creating artificial supervised learning problems, has recently emerged as an effective method to learn from unlabelled data. Current SSL methods are generally trained on ImageNet through image-level prediction tasks. We argue that this is suboptimal for application in semantic segmentation of RSIs since it does not take into account spatial position information between objects, which is critical for segmentation of RSIs characterized by multi-object. In this study, we propose a novel self-supervised dense representation learning method, IndexNet, for the semantic segmentation of RSIs. On the one hand, considering the multi-object characteristics of RSIs, IndexNet learns pixel-level representations by tracking object positions while maintaining sensitivity to object position changes to ensure that no mismatches are caused. On the other hand, by combining image-level contrast and pixel-level contrast, IndexNet can learn spatiotemporal invariant features. Experimental results show that our method works better than ImageNet pre-training and outperforms state-of-the-art (SOTA) self-supervised learning methods. Code and pre-trained models will be available at: https://github.com/pUmpKin-Co/offical-IndexNet.