Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

ConvViTMamba: efficient multiscale convolution, Transformer, and Mamba-based sequence modelling for hyperspectral image classification

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

ABSTRACT Hyperspectral image (HSI) classification remains a challenging task due to the high spectral dimensionality of the data, strong spectral redundancy, and limited availability of labelled samples. Although convolutional neural networks (CNNs) and Vision Transformers (ViTs) have demonstrated strong performance by exploiting spectral-spatial information and long-range dependencies, they often suffer from high computational complexity and large parameter counts, which limit their practical applicability. To address these limitations, a unified hybrid framework, termed ConvVitMamba, is proposed for efficient hyperspectral image classification. The proposed architecture integrates three complementary components within a single model: a multiscale convolutional feature extractor for capturing local spectral, spatial, and spectral-spatial patterns; a Vision Transformer-based tokenization and encoding stage for modelling global contextual relationships; and a lightweight Mamba-inspired gated sequence mixing module for efficient content-aware sequence refinement without relying on quadratic-complexity self-attention. Principal Component Analysis (PCA) is employed as a preprocessing step to reduce spectral redundancy and improve computational efficiency. Extensive experiments are conducted on four benchmark hyperspectral datasets, including Houston and three UAV-borne QUH datasets (Pingan, Qingyun, and Tangdaowan). Quantitative results, evaluated using Overall Accuracy, Average Accuracy, and the Kappa coefficient, demonstrate that ConvVitMamba consistently outperforms state-of-the-art CNN-, Transformer-, and Mamba-based methods while maintaining a favourable balance between classification accuracy, model size, and inference efficiency. Ablation studies further confirm the complementary contributions of the multiscale convolutional, transformer, and Mamba-inspired components. These results indicate that the proposed framework provides an effective and efficient solution for hyperspectral image classification under both urban and natural scene settings. The source code is publicly available at https://github.com/mqalkhatib/ConvVitMamba

Similar Papers
  • Research Article
  • Cite Count Icon 84
  • 10.1109/tgrs.2022.3185640
BS2T: Bottleneck Spatial–Spectral Transformer for Hyperspectral Image Classification
  • Jan 1, 2022
  • IEEE Transactions on Geoscience and Remote Sensing
  • Ruoxi Song + 4 more

Convolutional Neural Networks (CNNs) have been extensively applied to hyperspectral (HS) image classification tasks and achieved promising performance. However, for CNN based HS image classification methods, it is hard to depict the dependencies among HS image pixels in long-range distanced positions and bands. Moreover, the limited receptive field of the convolutional layers extremely hinders the development of the CNN structure. To tackle these problems, in this paper, the novel Bottleneck Spatial-Spectral Transformer (BS2T) is proposed to depict the long-range global dependencies of HS image pixels, which can be regarded as a feature extraction module for HS image classification networks. More specifically, inspired by Bottleneck Transformer in computer vision, for HS image feature extraction, the proposed BS2T is incorporated with a feature contraction module, a multi-head spatial-spectral self-attention (MHS2A) module and a feature expansion module. In this way, convolutional operations are replaced by the MHS2A to capture the long-range dependency of HS pixels regardless of their spatial position and distance. Meanwhile, in the MHS2A module, to highlight the spectral features of HS images, we introduce the spectral information and content spatial positional information to classical multi-head self-attentions to make the attentions more positional aware and spectral aware. On this basis, a dual-branch HS image classification framework based on 3D CNN and BS2T is defined for jointly extracting the local-global features of HS images. Experimental results on three public HS image classification datasets show that the proposed classification framework achieves a significant improvement when comparing with the state-of-the-art methods. The source code of the proposed framework can be downloaded from https://github.com/srxlnnu/BS2T.

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/3641584.3641609
Hyperspectral Image Classification Using 3D Attention Mechanism in Collaboration with Transformer
  • Sep 22, 2023
  • Yubing Wang + 2 more

With the continuous innovation in deep learning, it has become a major direction for scholars to introduce the knowledge of deep learning into hyperspectral image classification to enhance its classification accuracy. Convolutional Neural Networks (CNN) are one of the most commonly used deep learning-based visual data processing methods, and are widely used in hyperspectral image (HSI) classification by virtue of their excellent contextual modeling capability. Since the performance of HSI classification is highly dependent on spatial and spectral information, this paper proposes a hyperspectral image classification method using 3D attention mechanism in collaboration with Transformer for hyperspectral image classification in view of the problems that the current hyperspectral image classification models with the framework of CNN have insufficient spatial spectral feature extraction and fail to excavate and represent the sequence properties of spectral features well. In this paper, we introduce a variant Transformer model based on a hybrid model of both improved 3D-CNN and 2D-CNN, combining complementary information of spatial spectrum and spectra in the form of 3D convolution and 2D convolution on CNN, and adding a variant attention mechanism module to strengthen spatial texture features, while combining grouped transfer Transformer to jump connection to enable the lower layer to better learn the upper layer features. Firstly, a variant channel attention mechanism is introduced on 3D-CNN to enhance the acquisition of spectral information of image features by 3D-CNN. Secondly, a variant spatial attention mechanism is introduced to enable 3D-CNN to better acquire the spatial information of hyperspectral images in the network, and subsequently the acquired spatial and spectral feature information is passed to 2D-CNN to enable it to better acquire local feature information. Finally, the acquired image feature information is passed to the variant Transformer model to make up for the fact that CNN can only acquire hyperspectral image features in local contexts, enabling it to better acquire global feature information on feature sequences. The experimental results show that the proposed model is experimented on two hyperspectral datasets, Indian Pines and Pavia University, and the overall classification accuracy (OA), average classification accuracy (AA), and Kappa coefficient reach up to 99.59%, 99.31%, and 99.45%, respectively, on the PU dataset, compared with the current cutting-edge techniques. The classification accuracy has been improved.

  • Research Article
  • Cite Count Icon 32
  • 10.32604/cmes.2022.020601
Advances in Hyperspectral Image Classification Based on Convolutional Neural Networks: A Review
  • Jan 1, 2022
  • Computer Modeling in Engineering & Sciences
  • Somenath Bera + 2 more

Hyperspectral image (HSI) classification has been one of the most important tasks in the remote sensing community over the last few decades. Due to the presence of highly correlated bands and limited training samples in HSI, discriminative feature extraction was challenging for traditional machine learning methods. Recently, deep learning based methods have been recognized as powerful feature extraction tool and have drawn a significant amount of attention in HSI classification. Among various deep learning models, convolutional neural networks (CNNs) have shown huge success and offered great potential to yield high performance in HSI classification. Motivated by this successful performance, this paper presents a systematic review of different CNN architectures for HSI classification and provides some future guidelines. To accomplish this, our study has taken a few important steps. First, we have focused on different CNN architectures, which are able to extract spectral, spatial, and joint spectral-spatial features. Then, many publications related to CNN based HSI classifications have been reviewed systematically. Further, a detailed comparative performance analysis has been presented between four CNN models namely 1D CNN, 2D CNN, 3D CNN, and feature fusion based CNN (FFCNN). Four benchmark HSI datasets have been used in our experiment for evaluating the performance. Finally, we concluded the paper with challenges on CNN based HSI classification and future guidelines that may help the researchers to work on HSI classification using CNN.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 147
  • 10.3390/s18093153
Hyperspectral Image Classification with Capsule Network Using Limited Training Samples.
  • Sep 18, 2018
  • Sensors
  • Fei Deng + 5 more

Deep learning techniques have boosted the performance of hyperspectral image (HSI) classification. In particular, convolutional neural networks (CNNs) have shown superior performance to that of the conventional machine learning algorithms. Recently, a novel type of neural networks called capsule networks (CapsNets) was presented to improve the most advanced CNNs. In this paper, we present a modified two-layer CapsNet with limited training samples for HSI classification, which is inspired by the comparability and simplicity of the shallower deep learning models. The presented CapsNet is trained using two real HSI datasets, i.e., the PaviaU (PU) and SalinasA datasets, representing complex and simple datasets, respectively, and which are used to investigate the robustness or representation of every model or classifier. In addition, a comparable paradigm of network architecture design has been proposed for the comparison of CNN and CapsNet. Experiments demonstrate that CapsNet shows better accuracy and convergence behavior for the complex data than the state-of-the-art CNN. For CapsNet using the PU dataset, the Kappa coefficient, overall accuracy, and average accuracy are 0.9456, 95.90%, and 96.27%, respectively, compared to the corresponding values yielded by CNN of 0.9345, 95.11%, and 95.63%. Moreover, we observed that CapsNet has much higher confidence for the predicted probabilities. Subsequently, this finding was analyzed and discussed with probability maps and uncertainty analysis. In terms of the existing literature, CapsNet provides promising results and explicit merits in comparison with CNN and two baseline classifiers, i.e., random forests (RFs) and support vector machines (SVMs).

  • Research Article
  • Cite Count Icon 1
  • 10.3390/photonics12020146
3DVT: Hyperspectral Image Classification Using 3D Dilated Convolution and Mean Transformer
  • Feb 11, 2025
  • Photonics
  • Xinling Su + 1 more

Hyperspectral imaging and laser technology both rely on different wavelengths of light to analyze the characteristics of materials, revealing their composition, state, or structure through precise spectral data. In hyperspectral image (HSI) classification tasks, the limited number of labeled samples and the lack of feature extraction diversity often lead to suboptimal classification performance. Furthermore, traditional convolutional neural networks (CNNs) primarily focus on local features in hyperspectral data, neglecting long-range dependencies and global context. To address these challenges, this paper proposes a novel model that combines CNNs with an average pooling Vision Transformer (ViT) for hyperspectral image classification. The model utilizes three-dimensional dilated convolution and two-dimensional convolution to extract multi-scale spatial–spectral features, while ViT was employed to capture global features and long-range dependencies in the hyperspectral data. Unlike the traditional ViT encoder, which uses linear projection, our model replaces it with average pooling projection. This change enhances the extraction of local features and compensates for the ViT encoder’s limitations in local feature extraction. This hybrid approach effectively combines the local feature extraction strengths of CNNs with the long-range dependency handling capabilities of Transformers, significantly improving overall performance in hyperspectral image classification tasks. Additionally, the proposed method holds promise for the classification of fiber laser spectra, where high precision and spectral analysis are crucial for distinguishing between different fiber laser characteristics. Experimental results demonstrate that the CNN-Transformer model substantially improves classification accuracy on three benchmark hyperspectral datasets. The overall accuracies achieved on the three public datasets—IP, PU, and SV—were 99.35%, 99.31%, and 99.66%, respectively. These advancements offer potential benefits for a wide range of applications, including high-performance optical fiber sensing, laser medicine, and environmental monitoring, where accurate spectral classification is essential for the development of advanced systems in fields such as laser medicine and optical fiber technology.

  • Research Article
  • Cite Count Icon 36
  • 10.1109/tgrs.2022.3180685
Grafting Transformer on Automatically Designed Convolutional Neural Network for Hyperspectral Image Classification
  • Jan 1, 2022
  • IEEE Transactions on Geoscience and Remote Sensing
  • Xizhe Xue + 4 more

Hyperspectral image (HSI) classification has been a hot topic for decides, as hyperspectral images have rich spatial and spectral information and provide strong basis for distinguishing different land-cover objects. Benefiting from the development of deep learning technologies, deep learning based HSI classification methods have achieved promising performance. Recently, several neural architecture search (NAS) algorithms have been proposed for HSI classification, which further improve the accuracy of HSI classification to a new level. In this paper, NAS and Transformer are combined for handling HSI classification task for the first time. Compared with previous work, the proposed method has two main differences. First, we revisit the search spaces designed in previous HSI classification NAS methods and propose a novel hybrid search space, consisting of the space dominated cell and the spectrum dominated cell. Compared with search spaces proposed in previous works, the proposed hybrid search space is more aligned with the characteristic of HSI data, that is, HSIs have a relatively low spatial resolution and an extremely high spectral resolution. Second, to further improve the classification accuracy, we attempt to graft the emerging transformer module on the automatically designed convolutional neural network (CNN) to add global information to local region focused features learned by CNN. Experimental results on three public HSI datasets show that the proposed method achieves much better performance than comparison approaches, including manually designed network and NAS based HSI classification methods. Especially on the most recently captured dataset Houston University, overall accuracy is improved by nearly 6 percentage points. Code is available at: https://github.com/Cecilia-xue/HyT-NAS.

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/igarss39084.2020.9323727
Hyperspectral Image Classification Using Fisher's Linear Discriminant Analysis Feature Reduction with Gabor Filtering and CNN
  • Sep 26, 2020
  • Meilun Zhou + 3 more

Deep learning-based approaches for hyperspectral image (HSI) feature extraction and classification have gained popularity in recent years. Effective extraction of spectral and spatial information is desired for classifying HSI using a convolutional neural network (CNN) to avoid overfitting. Previous research suggests that Fisher's linear discriminant analysis (LDA) is a better alternative for HSI feature reduction compared to principal component analysis (PCA). In this work, an LDA approach is studied as a dimensionality reducer along with a Gabor filter for extracting spatial features and classification using CNN. The efficacy of the proposed approach is compared with a similar classification scheme with the PCA. Experimental results from two benchmark HSI datasets show the benefits of using LDA with notable improvements in class and overall accuracies.

  • Research Article
  • Cite Count Icon 183
  • 10.1109/tgrs.2019.2951445
Heterogeneous Transfer Learning for Hyperspectral Image Classification Based on Convolutional Neural Network
  • Dec 5, 2019
  • IEEE Transactions on Geoscience and Remote Sensing
  • Xin He + 2 more

Deep convolutional neural networks (CNNs) have shown their outstanding performance in the hyperspectral image (HSI) classification. The success of CNN-based HSI classification relies on the availability sufficient training samples. However, the collection of training samples is expensive and time consuming. Besides, there are many pretrained models on large-scale data sets, which extract the general and discriminative features. The proper reusage of low-level and midlevel representations will significantly improve the HSI classification accuracy. The large-scale ImageNet data set has three channels, but HSI contains hundreds of channels. Therefore, there are several difficulties to simply adapt the pretrained models for the classification of HSIs. In this article, heterogeneous transfer learning for HSI classification is proposed. First, a mapping layer is used to handle the issue of having different numbers of channels. Then, the model architectures and weights of the CNN trained on the ImageNet data sets are used to initialize the model and weights of the HSI classification network. Finally, a well-designed neural network is used to perform the HSI classification task. Furthermore, attention mechanism is used to adjust the feature maps due to the difference between the heterogeneous data sets. Moreover, controlled random sampling is used as another training sample selection method to test the effectiveness of the proposed methods. Experimental results on four popular hyperspectral data sets with two training sample selection strategies show that the transferred CNN obtains better classification accuracy than that of state-of-the-art methods. In addition, the idea of heterogeneous transfer learning may open a new window for further research.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 19
  • 10.3390/rs15051206
SS-TMNet: Spatial–Spectral Transformer Network with Multi-Scale Convolution for Hyperspectral Image Classification
  • Feb 22, 2023
  • Remote Sensing
  • Xiaohui Huang + 4 more

Hyperspectral image (HSI) classification is a significant foundation for remote sensing image analysis, widely used in biology, aerospace, and other applications. Convolution neural networks (CNNs) and attention mechanisms have shown outstanding ability in HSI classification and have been widely studied in recent years. However, the existing CNN-based and attention mechanism-based methods cannot fully use spatial–spectral information, which is not conducive to further improving HSI classification accuracy. This paper proposes a new spatial–spectral Transformer network with multi-scale convolution (SS-TMNet), which can effectively extract local and global spatial–spectral information. SS-TMNet includes two key modules, i.e., multi-scale 3D convolution projection module (MSCP) and spatial–spectral attention module (SSAM). The MSCP uses multi-scale 3D convolutions with different depths to extract the fused spatial–spectral features. The spatial–spectral attention module includes three branches: height spatial attention, width spatial attention, and spectral attention, which can extract the fusion information of spatial and spectral features. The proposed SS-TMNet was tested on three widely used HSI datasets: Pavia University, IndianPines, and Houston2013. The experimental results show that the proposed SS-TMNet is superior to the existing methods.

  • Research Article
  • 10.1080/01431161.2025.2457130
Bridging branches and attributes: spectral-spatial global-local interaction network for hyperspectral image classification
  • Feb 7, 2025
  • International Journal of Remote Sensing
  • Leiquan Wang + 7 more

The CNN-Transformer joint model stands as the leading architecture for contemporary hyperspectral image (HSI) classification, integrating global and local features through either successive or dual-branch CNN and Transformer networks. However, these methods often fall short in effectively incorporating spatial-spectral information with local-global attributes, resulting in incomplete feature representation. To address these challenges, we propose a spectral-spatial global-local interaction network that transmits global and local features into the spatial and spectral branches, facilitated by cross-interaction operators to ensure adequate feature flow. Initially, CNNs are employed to separately extract shallow features for the spectral and spatial branches. We then introduce a Spectral-Spatial Global-Local Interaction block designed for deep feature extraction, enhancing the flow of spectral and spatial features with local and global attributes using parallel transformers and dynamic convolutions. Transformers model the long-range dependencies of global spectral and spatial features, while dynamic convolutions enhance the context sensitivity of local spectral and spatial representations. Quadruple cross-interaction blocks are proposed to traverse both the spectral-spatial branches and local-global attribute dimensions, facilitating information exchange for complementary HSI representation. Extensive experiments and ablation studies on four public HSI datasets demonstrate the superiority of our proposed method. convolutional neural networks (CNNs); spectral and spatial; global and local; hyperspectral image classification; cross-interaction.

  • Research Article
  • Cite Count Icon 30
  • 10.1016/j.engappai.2024.108669
MSTSENet: Multiscale Spectral–Spatial Transformer with Squeeze and Excitation network for hyperspectral image classification
  • May 30, 2024
  • Engineering Applications of Artificial Intelligence
  • Irfan Ahmad + 4 more

MSTSENet: Multiscale Spectral–Spatial Transformer with Squeeze and Excitation network for hyperspectral image classification

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icicsp55539.2022.10050698
Lightweight Multilevel Feature Fusion Network for Hyperspectral Image Classification
  • Nov 26, 2022
  • Quanyu Huang + 3 more

Hyperspectral image (HSI) classification is the key technology of remote sensing image processing. In recent years, convolutional neural network (CNN), which is a powerful feature extractor, has been introduced into the field of HSI classification. Since the features of HSI are the basis of HSI classification, how to effectively extract the spectral-spatial features from HSI with CNN has become a research hotspot. The HSI feature extraction network, based on two-dimensional (2D) and three-dimensional (3D) CNN which can extract both spectral and spatial information, may lead to the increase of parameters and computational cost. Compared with 2D CNN and 3D CNN, the number of parameters and computational cost of one-dimensional (1D) CNN will be greatly reduced. However, 1D CNN based algorithms can only extract the spectral information without considering the spatial information. Therefore, in this paper, a lightweight multilevel feature fusion network (LMFFN) is proposed for HSI classification, which aims to achieve efficient extraction of spectral-spatial features and to minimize the number of parameters. The main contributions of this paper are divided into the following two points: First, we design a hybrid spectral-spatial feature extraction network (HSSFEN) to combine the advantages of 1D, 2D and 3D CNN. It introduces the idea of depthwise separable convolution method, which effectively reduces the complexity of the proposed HSSFEN. Then, a multilevel spectral-spatial feature fusion network (MSSFFN) is proposed to further obtain more effective spectral-spatial features, which effectively fuses the bottom spectral-spatial features and the top spectral-spatial features. To demonstrate the performance of our proposed method, a series of experiments are conducted on three HSI datasets, including Indian Pine, University of Pavia, and Salinas Scene datasets. The experimental results indicate that our proposed LMFFN is able to achieve better performance than the manual feature extraction methods and deep learning methods, which demonstrates the superiority of our proposed method.

  • Research Article
  • Cite Count Icon 12
  • 10.1049/ipr2.12632
Hybrid network model based on 3D convolutional neural network and scalable graph convolutional network for hyperspectral image classification
  • Sep 25, 2022
  • IET Image Processing
  • Xili Wang + 1 more

Hyperspectral images (HSIs) contain hundreds of continuous spectral bands and are rich in spectral‐spatial information. In terms of HSIs’ classification, traditional convolutional neural networks (CNNs) extract features based on HSI's spectral‐spatial information through 2D convolution. However, 2D convolution extracts features in 2D plane without considering the relationships between spectral bands, which inevitably leads to insufficient feature extraction. 3D convolutional neural networks (3DCNNs) take account of the correlations among spectral bands and outperform 2D convolutional networks in feature extraction, but the computational cost is rather expensive. To address the above problem, a light‐weight three‐layer 3D convolutional network Module (3D‐M) for HSIs’ spectral‐spatial feature extraction is proposed. Another challenge is that neither 2D convolution nor 3D convolution utilizes the structural information inherent in the data. Graph convolution networks (GCNs) can model and utilize such information through the similarity matrix, also known as adjacency matrix. However, traditional GCNs cannot handle large‐scale data because they construct adjacency matrix on all data, which results in high computational complexity and large storage requirement. To conquer this challenge, this article proposes a batch‐graph strategy on which a scalable GCN is developed. Finally, a hybrid network model (HNM) based on the proposed light‐weight 3D‐M and scalable GCN is presented. HNM extracts spectral‐spatial features of HSIs with low computational complexity through the light‐weight 3D convolution network and leverages the structural information in data via the scalable GCN. The experimental results on three public datasets with different sizes demonstrate that the proposed HNM produces better classification results than other state‐of‐the‐art hyperspectral images classification models in terms of overall accuracy (OA), average accuracy (AA) and kappa coefficient (Kappa).

  • Research Article
  • Cite Count Icon 30
  • 10.1080/2150704x.2019.1569274
Hyperspectral images classification with convolutional neural network and textural feature using limited training samples
  • Feb 1, 2019
  • Remote Sensing Letters
  • Wudi Zhao + 4 more

ABSTRACTIn this letter, a new deep learning framework, which integrates textural features of gray level co-occurrence matrix (GLCM) into convolutional neural networks (CNNs) is proposed for hyperspectral images (HSIs) classification using limited number of labeled samples. The proposed method can be implemented in three steps. Firstly, the GLCM textural features are extracted from the first principal component after the principal components analysis (PCA) transformation. Secondly, a CNN is built to extract the deep spectral features from the original HSIs, and the features are concatenated with the textural features obtained in the first step in a concat layer of CNN. Finally, softmax is employed to generate classification maps at the end of the framework. In this way, the CNN focuses on the learning of spectral features only, and the generated textural features are used directly as one set of features before softmax. These lead to the reduction of the requirements for the size of training samples and the improvement of computing efficiency. The experimental results are presented for three HSIs and compared with several advanced deep learning and spectral-spatial classification techniques. The competitive classification accuracy can be obtained, especially when only a limited number of training samples are available.

  • Research Article
  • Cite Count Icon 28
  • 10.1016/j.sigpro.2024.109669
State space models meet transformers for hyperspectral image classification
  • Aug 22, 2024
  • Signal Processing
  • Xuefei Shi + 6 more

State space models meet transformers for hyperspectral image classification

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant