Hyperspectral image (HSI) classification is currently a hot topic in the field of remote sensing. The goal is to utilize the spectral and spatial information from HSI to accurately identify land covers. Convolution neural network (CNN) is a powerful approach for HSI classification. However, CNN has limited ability to capture non-local information to represent complex features. Recently, vision transformers (ViTs) have gained attention due to their ability to process non-local information. Yet, under the HSI classification scenario with ultra-small sample rates, the spectral-spatial information given to ViTs for global modeling is insufficient, resulting in limited classification capability. Therefore, in this article, Multi-Attention Joint Convolution Feature Representation with Lightweight Transformer (MAR-LWFormer) is proposed, which effectively combines the spectral and spatial features of HSI to achieve efficient classification performance at ultra-small sample rates. Specifically, we use a three-branch network architecture to extract multi-scale convolved 3D-CNN, EMAP, and LBP features of HSI, respectively, by taking full exploitation of ultra-small training samples. Second, we design a series of multi-attention modules to enhance spectral-spatial representation for the three types of features and to improve the coupling and fusion of multiple features. Third, we propose an explicit feature attention tokenizer to transform the feature information, which maximizes the effective spectral-spatial information retained in the flat tokens. Finally, the generated tokens are input to the designed lightweight transformer for encoding and classification. Experimental results on three datasets validate that MAR-LWFormer has an excellent performance in HSI classification at ultra-small sample rates when compared to several state-of-the-art classifiers.
Read full abstract