Abstract

The vision transformer (ViT) has become a hot topic in image processing due to its global feature extraction capabilities. However, the ViT suffers from over-smoothing in feature extraction and over-fitting in the training procedure, so it is hard to achieve satisfactory performance in hyperspectral image (HSI) classification. To address these issues, we propose a ViT with contrastive learning (CViT). The network architecture includes a patch embedding module, transformer blocks, and a classifier. The training of CViT can be considered as an optimization problem with a supervised contrastive loss, an unsupervised contrastive loss, and an ℓ <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> -regularizer with respect to linear self-attention weights. Specifically, the supervised contrastive loss is proposed to alleviate the negative effects of HSI features’ spectral variability and spatial diversity by increasing intra-class consistency. On the other hand, the unsupervised contrastive loss is exploited to reduce redundancy by reconstructing global structural information. In particular, regularized linear self-attention weights reduce the over-smoothing issue. Extensive experimental results on three HSI datasets demonstrate that the proposed CViT achieves competitive performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call