Recently, there has been a growing interest in RGB-D object tracking thanks to its promising performance achieved by combining visual information with auxiliary depth cues. However, the limited volume of annotated RGB-D tracking data for offline training has hindered the development of a dedicated end-to-end RGB-D tracker design. Consequently, the current state-of-the-art RGB-D trackers mainly rely on the visual branch to support the appearance modelling, with the depth map utilised for elementary information fusion or failure reasoning of online tracking. Despite the achieved progress, the current paradigms for RGB-D tracking have not fully harnessed the inherent potential of depth information, nor fully exploited the synergy of vision-depth information. Considering the availability of ample unlabelled RGB-D data and the advancement in self-supervised learning, we address the problem of self-supervised learning for RGB-D object tracking. Specifically, an RGB-D backbone network is trained on unlabelled RGB-D datasets using masked image modelling. To train the network, the masking mechanism creates a selective occlusion of the input visible image to force the corresponding aligned depth map to help with discerning and learning vision-depth cues for the reconstruction of the masked visible image. As a result, the pre-trained backbone network is capable of cooperating with crucial visual and depth features of the diverse objects and background in the RGB-D image. The intermediate RGB-D features output by the pre-trained network can effectively be used for object tracking. We thus embed the pre-trained RGB-D network into a transformer-based tracking framework for stable tracking. Comprehensive experiments and the analysis of the results obtained on several RGB-D tracking datasets demonstrate the effectiveness and superiority of the proposed RGB-D self-supervised learning framework and the following tracking approach.