Abstract
The hyperspectral image (HSI) has nearly continuous spectral information, thus, the target of interest can be accurately identified by the subtle details of spectral properties. Spectral resolution at different scales can capture different levels of spectral features: small-scale spectral bands are beneficial for extracting global details in vision transformers, while large-scale spectral bands are more effective for local features. Transformer shows advantages in global information extraction with self-attention module and even surpasses CNNs in various tasks. Some works based on the vision transformer have performed surprisingly in HSI classification. However, single-scale vision transformers are insufficient to balance the extraction of local details and redundancy on different scales. The recent work, a multi-scale vision transformer, has provided a solution with spatial patch-wise features in image classification. Inspired by this, we propose the Cross-spectral vision transformer (CSiT) with two branches to extract pixel-wise multi-scale features and further design a multi-scale spectral embedding module to enhance local details between neighboring spectral bands. Moreover, based on the cross-attention operation, a single token for each branch is recognized as a query and used to exchange information with other branches. We evaluate the classification performance of the proposed CSiT in three classic HSI datasets with extensive experiments, showing the multi-scale vision transformer architecture has a promising result for HSI classification with one-dimensional spectral bands.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have