ABSTRACT While multimodal remote sensing data processing has garnered growing interests in the geoscience field, how to effectively extract, represent, and fuse heterogeneous features remains challenging. In this article, a novel heterogeneous feature learning network (HFLN) architecture is investigated for multimodal remote sensing collaborative classification, which takes both advantages of long-range feature modelling capability of transformer as well as local feature extraction of convolutional neural network (CNN). First, the global spectral dependencies are extracted from hyperspectral images using the spectral transformer structure. Then, local spatial invariant characteristics are extracted from multimodal remote sensing images by convolutional operation. Next, heterogeneous spectral and spatial characteristics, having significant structural and semantic differences, are dynamically integrated through the feature coupling module in an interactive manner. Finally, the multi-stage network architecture is utilized for extracting hierarchical characteristics, where the numbers of feature maps in CNN and heads in transformer are gradually increased to represent more complex features. Four benchmark remote sensing datasets (i.e. hyperspectral images and multimodal remote sensing datasets) are utilized for experimental analysis, and extensive experimental results have certified the progressiveness of the investigated approach.
Read full abstract