Abstract

Deep convolutional neural networks have been dominating in the field of hyperspectral image (HSI) classification. However, single convolutional kernels can limit the receptive field and fail to capture the sequential properties of data. Self-Attention-based Transformer can build global sequence information, among which, the Swin Transformer (SwinT) integrates sequence modeling capability and priori information of the visual signals (e.g., locality and translation invariance). Based on SwinT, we propose a 3D Swin Transformer (3DSwinT) to accommodate the 3D properties of HSI and capture the rich spatial-spectral information of HSI. Currently, supervised learning is still the most commonly used method for remote sensing image interpretation. However, pixel-by-pixel HSI classification demands a large number of high-quality labeled samples, which are time-consuming and costly to collect. As an unsupervised learning, self-supervised learning (SSL), especially contrastive learning, can learn semantic representations from unlabeled data, and hence, is becoming a potential alternative to supervised learning. On the other hand, current contrastive learning methods are all single-level or single-scale, which do not consider complex and variable multi-scale features of objects. Therefore, this paper proposes a novel 3DSwinT-based hierarchical contrastive learning method (3DSwinT-HCL), which can fully exploit multi-scale semantic representations of images. Besides, we propose a multi-scale local contrastive learning (MS-LCL) module to mine the pixel-level representations in order to adapt to downstream dense prediction tasks. A series of experiments verify the great potential and superiority of 3DSwinT-HCL.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call