Joint classification of hyperspectral images with hybrid modality can significantly enhance interpretation potentials, particularly when elevation information from the LiDAR sensor is integrated for outstanding performance. Recently, the transformer architecture was introduced to the HSI and LiDAR classification task, which has been verified as highly efficient. However, the existing naive transformer architectures suffer from two main drawbacks: 1) Inadequacy extraction for local spatial information and multi-scale information from HSI simultaneously. 2) The matrix calculation in the transformer consumes vast amounts of computing power. In this paper, we propose a novel Stochastic Window Transformer (SWFormer) framework to resolve these issues. First, the effective spatial and spectral feature projection networks are built independently based on hybrid-modal heterogeneous data composition using parallel feature extraction, which is conducive to excavating the perceptual features more representative along different dimensions. Furthermore, to construct local-global nonlinear feature maps more flexibly, we implement multi-scale strip convolution coupled with a transformer strategy. Moreover, in an innovative random window transformer structure, features are randomly masked to achieve sparse window pruning, alleviating the problem of information density redundancy, and reducing the parameters required for intensive attention. Finally, we designed a plug-and-play feature aggregation module that adapts domain offset between modal features adaptively to minimize semantic gaps between them and enhance the representational ability of the fusion feature. Three fiducial datasets demonstrate the effectiveness of the SWFormer in determining classification results.
Read full abstract