State space models meet transformers for hyperspectral image classification
State space models meet transformers for hyperspectral image classification
- # Hyperspectral Image Classification
- # Hyperspectral Remote Sensing Image Classification
- # Image Classification
- # Hyperspectral Image Classification Task
- # Hyperspectral Image
- # State Space Model
- # Self-attention Mechanisms
- # Convolutional Neural Networks
- # Classification Task
- # Accurate Image Classification
- Research Article
1
- 10.3390/photonics12020146
- Feb 11, 2025
- Photonics
Hyperspectral imaging and laser technology both rely on different wavelengths of light to analyze the characteristics of materials, revealing their composition, state, or structure through precise spectral data. In hyperspectral image (HSI) classification tasks, the limited number of labeled samples and the lack of feature extraction diversity often lead to suboptimal classification performance. Furthermore, traditional convolutional neural networks (CNNs) primarily focus on local features in hyperspectral data, neglecting long-range dependencies and global context. To address these challenges, this paper proposes a novel model that combines CNNs with an average pooling Vision Transformer (ViT) for hyperspectral image classification. The model utilizes three-dimensional dilated convolution and two-dimensional convolution to extract multi-scale spatial–spectral features, while ViT was employed to capture global features and long-range dependencies in the hyperspectral data. Unlike the traditional ViT encoder, which uses linear projection, our model replaces it with average pooling projection. This change enhances the extraction of local features and compensates for the ViT encoder’s limitations in local feature extraction. This hybrid approach effectively combines the local feature extraction strengths of CNNs with the long-range dependency handling capabilities of Transformers, significantly improving overall performance in hyperspectral image classification tasks. Additionally, the proposed method holds promise for the classification of fiber laser spectra, where high precision and spectral analysis are crucial for distinguishing between different fiber laser characteristics. Experimental results demonstrate that the CNN-Transformer model substantially improves classification accuracy on three benchmark hyperspectral datasets. The overall accuracies achieved on the three public datasets—IP, PU, and SV—were 99.35%, 99.31%, and 99.66%, respectively. These advancements offer potential benefits for a wide range of applications, including high-performance optical fiber sensing, laser medicine, and environmental monitoring, where accurate spectral classification is essential for the development of advanced systems in fields such as laser medicine and optical fiber technology.
- Research Article
84
- 10.1109/tgrs.2022.3185640
- Jan 1, 2022
- IEEE Transactions on Geoscience and Remote Sensing
Convolutional Neural Networks (CNNs) have been extensively applied to hyperspectral (HS) image classification tasks and achieved promising performance. However, for CNN based HS image classification methods, it is hard to depict the dependencies among HS image pixels in long-range distanced positions and bands. Moreover, the limited receptive field of the convolutional layers extremely hinders the development of the CNN structure. To tackle these problems, in this paper, the novel Bottleneck Spatial-Spectral Transformer (BS2T) is proposed to depict the long-range global dependencies of HS image pixels, which can be regarded as a feature extraction module for HS image classification networks. More specifically, inspired by Bottleneck Transformer in computer vision, for HS image feature extraction, the proposed BS2T is incorporated with a feature contraction module, a multi-head spatial-spectral self-attention (MHS2A) module and a feature expansion module. In this way, convolutional operations are replaced by the MHS2A to capture the long-range dependency of HS pixels regardless of their spatial position and distance. Meanwhile, in the MHS2A module, to highlight the spectral features of HS images, we introduce the spectral information and content spatial positional information to classical multi-head self-attentions to make the attentions more positional aware and spectral aware. On this basis, a dual-branch HS image classification framework based on 3D CNN and BS2T is defined for jointly extracting the local-global features of HS images. Experimental results on three public HS image classification datasets show that the proposed classification framework achieves a significant improvement when comparing with the state-of-the-art methods. The source code of the proposed framework can be downloaded from https://github.com/srxlnnu/BS2T.
- Research Article
13
- 10.1080/01431161.2022.2142078
- Sep 2, 2022
- International Journal of Remote Sensing
The superior local context modelling capability of convolutional neural networks (CNNs) in representing features allows greatly enhanced performance in hyperspectral image (HSI) classification tasks by CNN-based methods. However, most of these methods suffer from a restricted receptive field and poor performance in the continuous data domain. To address these issues, we propose a multi-granularity vision transformer via semantic token (MSTViT) for HSI classification, which differs from the existing transformer view by modelling the HSI classification tasks as word embedding problems. Specifically, the MSTViT model extracts multi-level semantic features by a ladder feature extractor and applies a multi-granularity patch embedding module to embed these features simultaneously as different-scale tokens. Moreover, different-granularity tokens are fed to the vision transformer to capture the long-distance dependencies among the different tokens. A depth-wise separable convolution multi-layer perceptron is used to assist the attention mechanism for further excavation of the deep information of HSI. Finally, the performance of HSI classification is improved by fusing the coarse- and fine-granularity representations to generate stronger features. Experimental results on four standard datasets verify the marked improvement of the MSTViT over state-of-the-art CNN and transformer structures. The code of this work is available at https://github.com/zhaolin6/MSTViT for the sake of reproducibility.
- Research Article
4
- 10.1080/01431161.2024.2408495
- Oct 7, 2024
- International Journal of Remote Sensing
The excellent capabilities of Transformers and Graph Neural Networks (GNNs) in modelling long-range dependencies and handling irregular data have led to their widespread application in hyperspectral image (HSI) classification tasks. However, the Graph Transformer combining both advantages is rarely used in this field and has some limitations. Current Graph Transformers consider interactions between all nodes within the graph, adding complexity and introducing unnecessary information from noisy nodes. Moreover, the rich spectral information in HSIs is often ignored, and there is a lack of effective fusion of spatial information. In this paper, we propose a dual-stream graph-guided Transformer for HSI classification. In spatial dimension, superpixels are utilized to guide spatial graph generation, capturing global topological dependencies and local details effectively in HSIs. In terms of spectrum, we innovatively construct a spectral graph based on spectral channels and adopt a contribution score-based strategy to adaptively filter out irrelevant edges, achieving sparsity while preserving spectral context relationships. Experimental results demonstrate the significant competitive advantage of our method in HSI classification tasks on three public datasets. The code is available at https://github.com/youngboy03/GTDPNet.
- Research Article
2
- 10.3390/rs17122008
- Jun 11, 2025
- Remote Sensing
Deep learning has recently achieved remarkable progress in hyperspectral image (HSI) classification. Among these advancements, the Transformer-based models have gained considerable attention due to their ability to establish long-range dependencies. However, the quadratic computational complexity of the self-attention mechanism limits its application in hyperspectral image classification (HSIC). Recently, the Mamba architecture has shown outstanding performance in 1D sequence modeling tasks owing to its lightweight linear sequence operations and efficient parallel scanning capabilities. Nevertheless, its application in HSI classification still faces challenges. Most existing Mamba-based approaches adopt various selective scanning strategies for HSI serialization, ensuring the adjacency of scanning sequences to enhance spatial continuity. However, these methods lead to substantially increased computational overhead. To overcome these challenges, this study proposes the Hyperspectral Spatial Mamba (HyperSMamba) model for HSIC, aiming to reduce computational complexity while improving classification performance. The suggested framework consists of the following key components: (1) a Multi-Scale Spatial Mamba (MS-Mamba) encoder, which refines the state-space model (SSM) computation by incorporating a Multi-Scale State Fusion Module (MSFM) after the state transition equations of original SSMs. This module aggregates adjacent state representations to reinforce spatial dependencies among local features; (2) our proposed Adaptive Fusion Attention Module (AFAttention) to dynamically fuse bidirectional Mamba outputs for optimizing feature representation. Experiments were performed on three HSI datasets, and the findings demonstrate that HyperSMamba attains overall accuracy of 94.86%, 97.72%, and 97.38% on the Indian Pines, Pavia University, and Salinas datasets, while maintaining low computational complexity. These results confirm the model’s effectiveness and potential for practical application in HSIC tasks.
- Research Article
59
- 10.3390/rs14164066
- Aug 19, 2022
- Remote Sensing
In recent years, deep-learning-based hyperspectral image (HSI) classification networks have become one of the most dominant implementations in HSI classification tasks. Among these networks, convolutional neural networks (CNNs) and attention-based networks have prevailed over other HSI classification networks. While convolutional neural networks with perceptual fields can effectively extract local features in the spatial dimension of HSI, they are poor at capturing the global and sequential features of spectral–spatial information; networks based on attention mechanisms, for example, Transformer, usually have better ability to capture global features, but are relatively weak in discriminating local features. This paper proposes a fusion network of convolution and Transformer for HSI classification, known as FusionNet, in which convolution and Transformer are fused in both serial and parallel mechanisms to achieve the full utilization of HSI features. Experimental results demonstrate that the proposed network has superior classification results compared to previous similar networks, and performs relatively well even on a small amount of training data.
- Research Article
43
- 10.1016/j.knosys.2020.106319
- Jul 29, 2020
- Knowledge-Based Systems
Hyperspectral image classification based on discriminative locality preserving broad learning system
- Research Article
2
- 10.1371/journal.pone.0322345
- May 23, 2025
- PloS one
Hyperspectral Image (HSI) classification tasks are usually impacted by Convolutional Neural Networks (CNN). Specifically, the majority of models using traditional convolutions for HSI classification tasks extract redundant information due to the convolution layer, which makes the subsequent network structure produce a large number of parameters and complex computations, so as to limit their classification effectiveness, particularly in situations with constraints on computational power and storage capacity. To address these issues, this paper proposes a lightweight multi-layer feature fusion classification method for hyperspectral images based on spatial and channel reconstruction (SCNet). Firstly, this method reduces redundant computations of spatial and spectral features by introducing Spatial and Channel Reconstruction Convolutions (SCConv), a novel convolutional compression method. Secondly, the proposed network backbone is stacked with multiple SCConv modules, which allows the network to capture spatial and spectral features that are more beneficial for hyperspectral image classification. Finally, to effectively utilize the multi-layer feature information generated by SCConv modules, a multi-layer feature fusion (MLFF) unit was designed to connect multiple feature maps at different depths, thereby obtaining a more robust feature representation. The experimental results demonstrate that, compared to seven other hyperspectral image classification methods, this network has significant advantages in terms of the number of parameters, model complexity, and testing time. These findings have been validated through experiments on four benchmark datasets.
- Research Article
38
- 10.3390/rs12122035
- Jun 24, 2020
- Remote Sensing
Recently, deep learning methods based on three-dimensional (3-D) convolution have been widely used in the hyperspectral image (HSI) classification tasks and shown good classification performance. However, affected by the irregular distribution of various classes in HSI datasets, most previous 3-D convolutional neural network (CNN)-based models require more training samples to obtain better classification accuracies. In addition, as the network deepens, which leads to the spatial resolution of feature maps gradually decreasing, much useful information may be lost during the training process. Therefore, how to ensure efficient network training is key to the HSI classification tasks. To address the issue mentioned above, in this paper, we proposed a 3-DCNN-based residual group channel and space attention network (RGCSA) for HSI classification. Firstly, the proposed bottom-up top-down attention structure with the residual connection can improve network training efficiency by optimizing channel-wise and spatial-wise features throughout the whole training process. Secondly, the proposed residual group channel-wise attention module can reduce the possibility of losing useful information, and the novel spatial-wise attention module can extract context information to strengthen the spatial features. Furthermore, our proposed RGCSA network only needs few training samples to achieve higher classification accuracies than previous 3-D-CNN-based networks. The experimental results on three commonly used HSI datasets demonstrate the superiority of our proposed network based on the attention mechanism and the effectiveness of the proposed channel-wise and spatial-wise attention modules for HSI classification. The code and configurations are released at Github.com.
- Research Article
37
- 10.1109/lgrs.2020.2991405
- Jun 1, 2020
- IEEE Geoscience and Remote Sensing Letters
Deep learning methods have shown their marvel performance on hyperspectral image (HSI) classification tasks. In particular, algorithms based on convolution neural network (CNN) outperformed most of the conventional machine learning-based algorithms and have become the mainstream of the current HSI classification research works. Recently, a newly proposed neural network called capsule network (CapsNet) showed its potential to replace the CNNs in various classification tasks with its amazing performance. In this letter, we proposed a new network architecture based on the CapsNet for HSI classification tasks, called dual-channel capsule network (DCCapsNet). Our DCCapsNet model extracts the features from spectral and spatial domains, respectively, with two separate convolution channels and then concatenates and feeds them into the following capsule layers to classify each of the HSI pixels. The model was trained and validated on four real HSI data sets and achieved high accuracy. We also compared our network with some of the state-of-the-art models and found that our model outperformed these competitor models.
- Conference Article
9
- 10.1109/igarss46834.2022.9884329
- Jul 17, 2022
In recent years, convolutional neural networks (CNNs) have been successfully applied in hyperspectral image (HSI) classification tasks. However, the spatial-spectral features within an HSI have not been well explored using convolutions in CNNs. In the paper, a novel end-to-end hierarchical spatial-spectral transformer (HSST) is proposed for HSI classification, in which effective spatial-spectral features are emphasized using multi-head self-attention mechanism (MHSA). MHSA module captures better internal correlation of HSI data than the traditional convolution operation and can compute weighting scores for spatial and spectral context of pixels. Furthermore, a hierarchical architecture is designed to reduce a large number of parameters in the original transformer-style networks while still achieving satisfying classification results. Experimental results over two benchmark HSI datasets demonstrated the proposed HSST obviously outperforms several state-of-the-art deep learning-based HSI classification algorithms.
- Conference Article
1
- 10.1109/icot.2018.8705785
- Oct 1, 2018
Hyperspectral image (HSI) contains various spectral and spatial information, which is often used in remote sensing image analysis and widely used in areas of the people’s daily life. Due to the advances of powerful feature representations, deep learning based methods are receiving increasing attention and getting acceptable classification results. As a representative of the deep learning methods, convolutional neural networks (CNNs) have shown their great ability in HSI classification tasks. However, the hyper-parameters of CNNs based HSI classification methods are often obtained through experience (e.g., the number of convolutional layers), and how to determine the number of convolutional layers (the model of convolutional layers connection) via data is seldom studied in existing CNNs based HSI classification methods. To deal with this problem, this paper proposes an effective approach to learn a structure of CNNs (e.g., a data-determined layers number of CNNs) in HSI classification tasks, where the CNNs structure can be learned via genetic algorithm (GA). With the learned adaptive CNNs structure can aquire better HSI classification result. Experimental results on two datasets demonstrate the effectiveness of the proposed method.
- Research Article
32
- 10.32604/cmes.2022.020601
- Jan 1, 2022
- Computer Modeling in Engineering & Sciences
Hyperspectral image (HSI) classification has been one of the most important tasks in the remote sensing community over the last few decades. Due to the presence of highly correlated bands and limited training samples in HSI, discriminative feature extraction was challenging for traditional machine learning methods. Recently, deep learning based methods have been recognized as powerful feature extraction tool and have drawn a significant amount of attention in HSI classification. Among various deep learning models, convolutional neural networks (CNNs) have shown huge success and offered great potential to yield high performance in HSI classification. Motivated by this successful performance, this paper presents a systematic review of different CNN architectures for HSI classification and provides some future guidelines. To accomplish this, our study has taken a few important steps. First, we have focused on different CNN architectures, which are able to extract spectral, spatial, and joint spectral-spatial features. Then, many publications related to CNN based HSI classifications have been reviewed systematically. Further, a detailed comparative performance analysis has been presented between four CNN models namely 1D CNN, 2D CNN, 3D CNN, and feature fusion based CNN (FFCNN). Four benchmark HSI datasets have been used in our experiment for evaluating the performance. Finally, we concluded the paper with challenges on CNN based HSI classification and future guidelines that may help the researchers to work on HSI classification using CNN.
- Research Article
36
- 10.1109/tgrs.2022.3180685
- Jan 1, 2022
- IEEE Transactions on Geoscience and Remote Sensing
Hyperspectral image (HSI) classification has been a hot topic for decides, as hyperspectral images have rich spatial and spectral information and provide strong basis for distinguishing different land-cover objects. Benefiting from the development of deep learning technologies, deep learning based HSI classification methods have achieved promising performance. Recently, several neural architecture search (NAS) algorithms have been proposed for HSI classification, which further improve the accuracy of HSI classification to a new level. In this paper, NAS and Transformer are combined for handling HSI classification task for the first time. Compared with previous work, the proposed method has two main differences. First, we revisit the search spaces designed in previous HSI classification NAS methods and propose a novel hybrid search space, consisting of the space dominated cell and the spectrum dominated cell. Compared with search spaces proposed in previous works, the proposed hybrid search space is more aligned with the characteristic of HSI data, that is, HSIs have a relatively low spatial resolution and an extremely high spectral resolution. Second, to further improve the classification accuracy, we attempt to graft the emerging transformer module on the automatically designed convolutional neural network (CNN) to add global information to local region focused features learned by CNN. Experimental results on three public HSI datasets show that the proposed method achieves much better performance than comparison approaches, including manually designed network and NAS based HSI classification methods. Especially on the most recently captured dataset Houston University, overall accuracy is improved by nearly 6 percentage points. Code is available at: https://github.com/Cecilia-xue/HyT-NAS.
- Research Article
7
- 10.1364/josaa.478585
- Feb 21, 2023
- Journal of the Optical Society of America A
In recent years, generative adversarial networks (GNAs), consisting of two competing 2D convolutional neural networks (CNNs) that are used as a generator and a discriminator, have shown their promising capabilities in hyperspectral image (HSI) classification tasks. Essentially, the performance of HSI classification lies in the feature extraction ability of both spectral and spatial information. The 3D CNN has excellent advantages in simultaneously mining the above two types of features but has rarely been used due to its high computational complexity. This paper proposes a hybrid spatial-spectral generative adversarial network (HSSGAN) for effective HSI classification. The hybrid CNN structure is developed for the construction of the generator and the discriminator. For the discriminator, the 3D CNN is utilized to extract the multi-band spatial-spectral feature, and then we use the 2D CNN to further represent the spatial information. To reduce the accuracy loss caused by information redundancy, a channel and spatial attention mechanism (CSAM) is specially designed. To be specific, a channel attention mechanism is exploited to enhance the discriminative spectral features. Furthermore, the spatial self-attention mechanism is developed to learn the long-term spatial similarity, which can effectively suppress invalid spatial features. Both quantitative and qualitative experiments implemented on four widely used hyperspectral datasets show that the proposed HSSGAN has a satisfactory classification effect compared to conventional methods, especially with few training samples.