Abstract

ABSTRACT The superior local context modelling capability of convolutional neural networks (CNNs) in representing features allows greatly enhanced performance in hyperspectral image (HSI) classification tasks by CNN-based methods. However, most of these methods suffer from a restricted receptive field and poor performance in the continuous data domain. To address these issues, we propose a multi-granularity vision transformer via semantic token (MSTViT) for HSI classification, which differs from the existing transformer view by modelling the HSI classification tasks as word embedding problems. Specifically, the MSTViT model extracts multi-level semantic features by a ladder feature extractor and applies a multi-granularity patch embedding module to embed these features simultaneously as different-scale tokens. Moreover, different-granularity tokens are fed to the vision transformer to capture the long-distance dependencies among the different tokens. A depth-wise separable convolution multi-layer perceptron is used to assist the attention mechanism for further excavation of the deep information of HSI. Finally, the performance of HSI classification is improved by fusing the coarse- and fine-granularity representations to generate stronger features. Experimental results on four standard datasets verify the marked improvement of the MSTViT over state-of-the-art CNN and transformer structures. The code of this work is available at https://github.com/zhaolin6/MSTViT for the sake of reproducibility.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call