Abstract

Multimodal feature fusion representation, e.g., hyperspectral image and light detection and ranging (HSI-LiDAR) fusion, is an essential topic for fusion perception. However, existing networks tend to employ mandatory feature stacking or local context fusion strategies between multiple modalities, ignoring the power of globally mutual-guided feature transmission. Therefore, this paper develops a mutually beneficial transformer method for multimodal data fusion (MBFormer), which contains the following steps. First, a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">spatial constraint-based self-attention (SCS) module</i> . In this module, spectralwise attention and a spatialwise convolution are applied to HSI and LiDAR data individually, and then a spatial guide mask generated from LiDAR elevation information is used as an agent to bridge with HSI for spatial feature constraints. Second, a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">channel diversity-based transformer (CDT) module</i> . On the basis of local spectral embedding explorations, an adaptive token-mixer mechanism is conducted on the groupwise classification token of HSI and individual LiDAR data for global information connectivity and transitivity. At last, the selected features are embedded into a classification layer for the final result calculation. Experimental results show that the proposed MBFormer can obtain 97.76% and 98.62% classification accuracies on Houston and Trento datasets, respectively, indicating the advantages and competitiveness of the MBFormer over the compared state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call