Hyperspectral Image Classification Using 3D Attention Mechanism in Collaboration with Transformer
With the continuous innovation in deep learning, it has become a major direction for scholars to introduce the knowledge of deep learning into hyperspectral image classification to enhance its classification accuracy. Convolutional Neural Networks (CNN) are one of the most commonly used deep learning-based visual data processing methods, and are widely used in hyperspectral image (HSI) classification by virtue of their excellent contextual modeling capability. Since the performance of HSI classification is highly dependent on spatial and spectral information, this paper proposes a hyperspectral image classification method using 3D attention mechanism in collaboration with Transformer for hyperspectral image classification in view of the problems that the current hyperspectral image classification models with the framework of CNN have insufficient spatial spectral feature extraction and fail to excavate and represent the sequence properties of spectral features well. In this paper, we introduce a variant Transformer model based on a hybrid model of both improved 3D-CNN and 2D-CNN, combining complementary information of spatial spectrum and spectra in the form of 3D convolution and 2D convolution on CNN, and adding a variant attention mechanism module to strengthen spatial texture features, while combining grouped transfer Transformer to jump connection to enable the lower layer to better learn the upper layer features. Firstly, a variant channel attention mechanism is introduced on 3D-CNN to enhance the acquisition of spectral information of image features by 3D-CNN. Secondly, a variant spatial attention mechanism is introduced to enable 3D-CNN to better acquire the spatial information of hyperspectral images in the network, and subsequently the acquired spatial and spectral feature information is passed to 2D-CNN to enable it to better acquire local feature information. Finally, the acquired image feature information is passed to the variant Transformer model to make up for the fact that CNN can only acquire hyperspectral image features in local contexts, enabling it to better acquire global feature information on feature sequences. The experimental results show that the proposed model is experimented on two hyperspectral datasets, Indian Pines and Pavia University, and the overall classification accuracy (OA), average classification accuracy (AA), and Kappa coefficient reach up to 99.59%, 99.31%, and 99.45%, respectively, on the PU dataset, compared with the current cutting-edge techniques. The classification accuracy has been improved.
- Research Article
18
- 10.3390/rs14092265
- May 8, 2022
- Remote Sensing
In recent years, hyperspectral image (HSI) classification has become a hot research direction in remote sensing image processing. Benefiting from the development of deep learning, convolutional neural networks (CNNs) have shown extraordinary achievements in HSI classification. Numerous methods combining CNNs and attention mechanisms (AMs) have been proposed for HSI classification. However, to fully mine the features of HSI, some of the previous methods apply dense connections to enhance the feature transfer between each convolution layer. Although dense connections allow these methods to fully extract features in a few training samples, it decreases the model efficiency and increases the computational cost. Furthermore, to balance model performance against complexity, the AMs in these methods compress a large number of channels or spatial resolutions during the training process, which results in a large amount of useful information being discarded. To tackle these issues, in this article, a novel one-shot dense network with polarized attention, namely, OSDN, was proposed for HSI classification. More precisely, since HSI contains rich spectral and spatial information, the OSDN has two independent branches to extract spectral and spatial features, respectively. Similarly, the polarized AMs contain two components: channel-only AMs and spatial-only AMs. Both polarized AMs can use a specially designed filtering method to reduce the complexity of the model while maintaining high internal resolution in both the channel and spatial dimensions. To verify the effectiveness and lightness of OSDN, extensive experiments were carried out on five benchmark HSI datasets, namely, Pavia University (PU), Kennedy Space Center (KSC), Botswana (BS), Houston 2013 (HS), and Salinas Valley (SV). Experimental results consistently showed that the OSDN can greatly reduce computational cost and parameters while maintaining high accuracy in a few training samples.
- Research Article
36
- 10.1109/tgrs.2022.3180685
- Jan 1, 2022
- IEEE Transactions on Geoscience and Remote Sensing
Hyperspectral image (HSI) classification has been a hot topic for decides, as hyperspectral images have rich spatial and spectral information and provide strong basis for distinguishing different land-cover objects. Benefiting from the development of deep learning technologies, deep learning based HSI classification methods have achieved promising performance. Recently, several neural architecture search (NAS) algorithms have been proposed for HSI classification, which further improve the accuracy of HSI classification to a new level. In this paper, NAS and Transformer are combined for handling HSI classification task for the first time. Compared with previous work, the proposed method has two main differences. First, we revisit the search spaces designed in previous HSI classification NAS methods and propose a novel hybrid search space, consisting of the space dominated cell and the spectrum dominated cell. Compared with search spaces proposed in previous works, the proposed hybrid search space is more aligned with the characteristic of HSI data, that is, HSIs have a relatively low spatial resolution and an extremely high spectral resolution. Second, to further improve the classification accuracy, we attempt to graft the emerging transformer module on the automatically designed convolutional neural network (CNN) to add global information to local region focused features learned by CNN. Experimental results on three public HSI datasets show that the proposed method achieves much better performance than comparison approaches, including manually designed network and NAS based HSI classification methods. Especially on the most recently captured dataset Houston University, overall accuracy is improved by nearly 6 percentage points. Code is available at: https://github.com/Cecilia-xue/HyT-NAS.
- Conference Article
7
- 10.1109/fskd.2017.8393336
- Jul 1, 2017
Hyperspectral Image (HSI) classification is one of the most persistent issue in remote sensing field. Recently, deep learning has attracted attention in HSI Classification field due to its accuracy and stronger generalization. This paper proposes a new spectral-spatial HSI classification approach developed on the deep learning concept of stacked-auto-encoders (SAE) based deep feature extraction and hidden Markov random field based segmentation. Specifically, First the SAE model is implemented as a spectral information-based classifier to extract the deep spectral features. Second, spatial information is obtained by using effective Hidden Markov random field (HMRF) based segmentation technique. Finally, maximum voting based criteria is employed to merge the extracted spectral and spatial information, which results in the precise spectral-spatial HSI classification. The characterization of the HSI with spectral spatial features results into more comprehensive analysis of HSI and to a more accurate classification. In general, use of spectral information resulted from the SAE process and spatial information by means of HMRF based segmentation and merging of spectral and spatial information by means of maximum voting based criteria, has a significant effect on the accuracy of the HSI classification. Experiments on real diverse hyperspectral data sets with different contexts and resolutions acquired by AVIRIS and ROSIS sensors show the accuracy of the proposed method and confirms that results of the proposed classification approach are comparable to several recently proposed HSI classification techniques.
- Research Article
1
- 10.3390/rs16111888
- May 24, 2024
- Remote Sensing
In recent years, the use of deep neural network in effective network feature extraction and the design of efficient and high-precision hyperspectral image classification algorithms has gradually become a research hotspot for scholars. However, due to the difficulty of obtaining hyperspectral images and the high cost of annotation, the training samples are very limited. In order to cope with the small sample problem, researchers often deepen the network model and use the attention mechanism to extract features; however, as the network model continues to deepen, the gradient disappears, the feature extraction ability is insufficient, and the computational cost is high. Therefore, how to make full use of the spectral and spatial information in limited samples has gradually become a difficult problem. In order to cope with such problems, this paper proposes two-branch multiscale spatial–spectral feature aggregation with a self-attention mechanism for a hyperspectral image classification model (FHDANet); the model constructs a dense two-branch pyramid structure, which can achieve the high efficiency extraction of joint spatial–spectral feature information and spectral feature information, reduce feature loss to a large extent, and strengthen the model’s ability to extract contextual information. A channel–space attention module, ECBAM, is proposed, which greatly improves the extraction ability of the model for salient features, and a spatial information extraction module based on the deep feature fusion strategy HLDFF is proposed, which fully strengthens feature reusability and mitigates the feature loss problem brought about by the deepening of the model. Compared with five hyperspectral image classification algorithms, SVM, SSRN, A2S2K-ResNet, HyBridSN, SSDGL, RSSGL and LANet, this method significantly improves the classification performance on four representative datasets. Experiments have demonstrated that FHDANet can better extract and utilise the spatial and spectral information in hyperspectral images with excellent classification performance under small sample conditions.
- Research Article
3
- 10.1117/1.jrs.16.034504
- Jul 9, 2022
- Journal of Applied Remote Sensing
Hyperspectral image (HSI) classification is a procedure of interest in remote sensing. HSIs contain complex spectral and spatial information, so classification tasks remain difficult. Although current deep-learning models have made significant progress in HSI classification, dealing with spectral and spatial information still requires careful investigation. To better manage spectral and spatial information and improve classification accuracy, we introduce a multiscale residual weakly dense network with an attention mechanism. First, we designed two residual weakly dense (Res-WDens) branches to extract spectral and spatial feature information and then applied the Concat method to fuse the two kinds of information. We also designed a plug-and-play hybrid attention module to refine the fused information so the network could focus on the essential spectral and spatial features. Finally, considering the relevance of spectral and spatial information, a dual-channel multiscale feature extraction module was used to extract the spectral–spatial multiscale information of HSIs. The overall accuracies of our proposed method reached 99.76%, 99.97%, and 100% on three publicly available datasets. A series of experiments demonstrated that our method is comparable to current state-of-the-art methods.
- Research Article
19
- 10.3390/rs15051206
- Feb 22, 2023
- Remote Sensing
Hyperspectral image (HSI) classification is a significant foundation for remote sensing image analysis, widely used in biology, aerospace, and other applications. Convolution neural networks (CNNs) and attention mechanisms have shown outstanding ability in HSI classification and have been widely studied in recent years. However, the existing CNN-based and attention mechanism-based methods cannot fully use spatial–spectral information, which is not conducive to further improving HSI classification accuracy. This paper proposes a new spatial–spectral Transformer network with multi-scale convolution (SS-TMNet), which can effectively extract local and global spatial–spectral information. SS-TMNet includes two key modules, i.e., multi-scale 3D convolution projection module (MSCP) and spatial–spectral attention module (SSAM). The MSCP uses multi-scale 3D convolutions with different depths to extract the fused spatial–spectral features. The spatial–spectral attention module includes three branches: height spatial attention, width spatial attention, and spectral attention, which can extract the fusion information of spatial and spectral features. The proposed SS-TMNet was tested on three widely used HSI datasets: Pavia University, IndianPines, and Houston2013. The experimental results show that the proposed SS-TMNet is superior to the existing methods.
- Research Article
- 10.1080/01431161.2026.2658272
- Apr 17, 2026
- International Journal of Remote Sensing
Hyperspectral image (HSI) classification is based on the principle that the same object exhibits the same spectrum. However, due to the presence of spectral mixing, relying solely on spectral information makes it difficult to achieve accurate classification. Therefore, effectively extracting and integrating spatial-spectral information are crucial for HSI classification. Modelling long-range dependencies among spatial pixels to extract global spatial context is helpful for identifying and understanding land-cover categories and spatial structure distribution in the image. In recent years, the Mamba model has attracted much attention and has been widely applied in HSI classification due to its ability to model long-range dependencies with linear computational complexity. However, it is challenging for a single Mamba model to comprehensively understand spatial and spectral information. Therefore, we propose a novel HSI classification model named NexusMamba, which combines Mamba with the convolutional network to extract spatial and spectral information separately and adaptively integrates spatial-spectral information. Specifically, we design a global spatial Mamba module (GSMM) to model long-range dependencies at the pixel-level for the entire image. Subsequently, we propose a local spectral convolution module (LSCM) to capture local detail information in spectral bands and extract spectral features from a local perspective. Finally, we propose a spatial-spectral adaptive fusion module (SSAFM) to adaptively integrate the spatial and spectral features of HSI. To evaluate the classification performance of NexusMamba, we conducted extensive experiments on three different HSI datasets. The experimental results demonstrate its superior performance in terms of classification accuracy and efficiency. Specifically, NexusMamba achieves OA improvements of 1.96%, 1.46% and 1.78% on the PU, IP and HongHu datasets, respectively. This also reveals that Mamba is expected to become the core backbone of next-generation HSI classification models.
- Research Article
84
- 10.1109/tgrs.2022.3185640
- Jan 1, 2022
- IEEE Transactions on Geoscience and Remote Sensing
Convolutional Neural Networks (CNNs) have been extensively applied to hyperspectral (HS) image classification tasks and achieved promising performance. However, for CNN based HS image classification methods, it is hard to depict the dependencies among HS image pixels in long-range distanced positions and bands. Moreover, the limited receptive field of the convolutional layers extremely hinders the development of the CNN structure. To tackle these problems, in this paper, the novel Bottleneck Spatial-Spectral Transformer (BS2T) is proposed to depict the long-range global dependencies of HS image pixels, which can be regarded as a feature extraction module for HS image classification networks. More specifically, inspired by Bottleneck Transformer in computer vision, for HS image feature extraction, the proposed BS2T is incorporated with a feature contraction module, a multi-head spatial-spectral self-attention (MHS2A) module and a feature expansion module. In this way, convolutional operations are replaced by the MHS2A to capture the long-range dependency of HS pixels regardless of their spatial position and distance. Meanwhile, in the MHS2A module, to highlight the spectral features of HS images, we introduce the spectral information and content spatial positional information to classical multi-head self-attentions to make the attentions more positional aware and spectral aware. On this basis, a dual-branch HS image classification framework based on 3D CNN and BS2T is defined for jointly extracting the local-global features of HS images. Experimental results on three public HS image classification datasets show that the proposed classification framework achieves a significant improvement when comparing with the state-of-the-art methods. The source code of the proposed framework can be downloaded from https://github.com/srxlnnu/BS2T.
- Research Article
3
- 10.1109/lgrs.2022.3223090
- Jan 1, 2022
- IEEE Geoscience and Remote Sensing Letters
Compared with traditional images, hyperspectral images (HSI) not only have spatial information, but also have rich spectral information. However, the mainstream hyperspectral image classification (HIC) methods are all based on Convolutional Neural Network (CNN), which has great advantages in extracting spatial features, but it has certain limitations in dealing with spectral continuous sequence information. Therefore Transformer which is good at processing sequences, has also been gradually applied to HIC. Besides, Since HSI are typical three-dimensional structures, we believe that the correlation of the three dimensions is also an important information. So in order to fully extract the spectral spatial information, as well as the correlation of the three dimensions. we propose a spectral and spatial feature fusion module ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e</i> ., TransCNN) for HIC. TransCNN consists of CNNs and a Transformer. The former is in charge of mining the spatial and spectral information from different dimensions, while the latter not only undertakes the most critical fusion but also captures the deeper relationship characteristics. We transpose the data to extract features and their correlation through three CNNs branches. we believe that these feature maps still have deep spectral information. Therefore, we have embedded them into one-dimensional vectors and use Transformer's Encoder to extract features. However, some information will be lost when embedding into one-dimensional vectors. Therefore we use Decoder, which has been ignored in the field of vision, to fuse the features before passing Encoder and the features after extracted by Encoders. Two kinds of features are fused by Decoder, and the obtained information is finally input into the classifier for classification. Experimental results on real HSIs show that the proposed architecture can achieve competitive performance compared with the state-of-the-art methods.
- Research Article
41
- 10.1109/tgrs.2023.3281511
- Jan 1, 2023
- IEEE Transactions on Geoscience and Remote Sensing
Hyperspectral image (HSI) classification is currently a hot topic in the field of remote sensing. The goal is to utilize the spectral and spatial information from HSI to accurately identify land covers. Convolution neural network (CNN) is a powerful approach for HSI classification. However, CNN has limited ability to capture non-local information to represent complex features. Recently, vision transformers (ViTs) have gained attention due to their ability to process non-local information. Yet, under the HSI classification scenario with ultra-small sample rates, the spectral-spatial information given to ViTs for global modeling is insufficient, resulting in limited classification capability. Therefore, in this article, Multi-Attention Joint Convolution Feature Representation with Lightweight Transformer (MAR-LWFormer) is proposed, which effectively combines the spectral and spatial features of HSI to achieve efficient classification performance at ultra-small sample rates. Specifically, we use a three-branch network architecture to extract multi-scale convolved 3D-CNN, EMAP, and LBP features of HSI, respectively, by taking full exploitation of ultra-small training samples. Second, we design a series of multi-attention modules to enhance spectral-spatial representation for the three types of features and to improve the coupling and fusion of multiple features. Third, we propose an explicit feature attention tokenizer to transform the feature information, which maximizes the effective spectral-spatial information retained in the flat tokens. Finally, the generated tokens are input to the designed lightweight transformer for encoding and classification. Experimental results on three datasets validate that MAR-LWFormer has an excellent performance in HSI classification at ultra-small sample rates when compared to several state-of-the-art classifiers.
- Research Article
- 10.1016/j.neunet.2025.108512
- May 1, 2026
- Neural networks : the official journal of the International Neural Network Society
CSA-Kansformer : Cross-scale aggregation and Kansformer network for hyperspectral image classification.
- Research Article
69
- 10.1109/lgrs.2021.3117577
- Jan 1, 2022
- IEEE Geoscience and Remote Sensing Letters
Deep learning has achieved great success in hyperspectral image (HSI) classification. However, its success relies on the availability of sufficient training samples. Unfortunately, the collection of training samples is expensive, time-consuming, and even impossible in some cases. Natural image datasets that are different from HSI, such as Image Net and mini-ImageNet, have abundant texture and structure information. Effective knowledge transfer between two heterogeneous datasets can significantly improve the accuracy of HSI classification. In this letter, heterogeneous few-shot learning (HFSL) for HSI classification is proposed with only a few labeled samples per class. First, few-shot learning is performed on the mini-ImageNet datasets to learn the transferable knowledge. Then, to make full use of the spatial and spectral information, a spectral–spatial fusion network is devised. Spectral information is obtained by the residual network with pure 1-D operators. Spatial information is extracted by a convolution network with pure 2-D operators, and the weights of the spatial network are initialized by the weights of the model trained on the mini-ImageNet datasets. Finally, few-shot learning is fine-tuned on HSI to extract discriminative spectral–spatial features and individual knowledge, which can improve the classification performance of the new classification task. Experiments conducted on two public HSI datasets demonstrate that the HFSL outperforms the existing few-shot learning methods and supervised learning methods for HSI classification with only a few labeled samples. Our source code is available at <uri>https://github.com/Li-ZK/HFSL</uri>.
- Research Article
17
- 10.3390/rs15071803
- Mar 28, 2023
- Remote Sensing
Hyperspectral images (HSI) contain powerful spectral characterization capabilities and are widely used especially for classification applications. However, the rich spectrum contained in HSI also increases the difficulty of extracting useful information, which makes the feature extraction method significant as it enables effective expression and utilization of the spectrum. Traditional HSI feature extraction methods design spectral features manually, which is likely to be limited by the complex spectral information within HSI. Recently, data-driven methods, especially the use of convolutional neural networks (CNNs), have shown great improvements in performance when processing image data owing to their powerful automatic feature learning and extraction abilities and are also widely used for HSI feature extraction and classification. The CNN extracts features based on the convolution operation. Nevertheless, the local perception of the convolution operation makes CNN focus on the local spectral features (LSF) and weakens the description of features between long-distance spectral ranges, which will be referred to as global spectral features (GSF) in this study. LSF and GSF describe the spectral features from two different perspectives and are both essential for determining the spectrum. Thus, in this study, a local-global spectral feature (LGSF) extraction and optimization method is proposed to jointly consider the LSF and GSF for HSI classification. To increase the relationship between spectra and the possibility to obtain features with more forms, we first transformed the 1D spectral vector into a 2D spectral image. Based on the spectral image, the local spectral feature extraction module (LSFEM) and the global spectral feature extraction module (GSFEM) are proposed to automatically extract the LGSF. The loss function for spectral feature optimization is proposed to optimize the LGSF and obtain improved class separability inspired by contrastive learning. We further enhanced the LGSF by introducing spatial relation and designed a CNN constructed using dilated convolution for classification. The proposed method was evaluated on four widely used HSI datasets, and the results highlighted its comprehensive utilization of spectral information as well as its effectiveness in HSI classification.
- Research Article
183
- 10.1109/tgrs.2019.2951445
- Dec 5, 2019
- IEEE Transactions on Geoscience and Remote Sensing
Deep convolutional neural networks (CNNs) have shown their outstanding performance in the hyperspectral image (HSI) classification. The success of CNN-based HSI classification relies on the availability sufficient training samples. However, the collection of training samples is expensive and time consuming. Besides, there are many pretrained models on large-scale data sets, which extract the general and discriminative features. The proper reusage of low-level and midlevel representations will significantly improve the HSI classification accuracy. The large-scale ImageNet data set has three channels, but HSI contains hundreds of channels. Therefore, there are several difficulties to simply adapt the pretrained models for the classification of HSIs. In this article, heterogeneous transfer learning for HSI classification is proposed. First, a mapping layer is used to handle the issue of having different numbers of channels. Then, the model architectures and weights of the CNN trained on the ImageNet data sets are used to initialize the model and weights of the HSI classification network. Finally, a well-designed neural network is used to perform the HSI classification task. Furthermore, attention mechanism is used to adjust the feature maps due to the difference between the heterogeneous data sets. Moreover, controlled random sampling is used as another training sample selection method to test the effectiveness of the proposed methods. Experimental results on four popular hyperspectral data sets with two training sample selection strategies show that the transferred CNN obtains better classification accuracy than that of state-of-the-art methods. In addition, the idea of heterogeneous transfer learning may open a new window for further research.
- Research Article
48
- 10.1109/tgrs.2023.3258488
- Jan 1, 2023
- IEEE Transactions on Geoscience and Remote Sensing
Hyperspectral image (HSI) classification aims to distinguish the category of a land coverage object for each pixel. In an effective way, the transformer architecture has been successfully introduced for the HSI classification task with promising performance. However, existing transformer-based HSI classification methods still suffer from the inability to fully explore both spectral information and spatial information in HSIs. To this end, we propose a Spectral-Spatial Token Enhanced Transformer (SSTE-Former) method with the hash-based positional embedding, which is the first to exploit multiscale spectral-spatial information for transformer-based HSI classification in-depth. Specifically, SSTE-Former accepts multiscale HSI cubes centered on the target pixel, that are preprocessed by PCA. Then, a designed multiscale CNN architecture is utilized to extract short-range spectral-spatial features and generate token embeddings. In parallel, a novel hash-based spatially enhanced positional embedding tailored for HSI cubes is developed to model the correlations within and across multiscale token embeddings. Finally, multiscale token embeddings and hash-based positional embeddings are concatenated and flattened into the transformer encoder for long-range spectral-spatial feature fusion. We conduct extensive experiments on four benchmark HSI datasets and achieve superior performance compared with the state-of-the-art HSI classification methods.