Multimodal Transformer Network for Hyperspectral and LiDAR Classification

Yiyan Zhang,Chenkai Zhang,Chenming Li,Shufang Xu,Hongmin Gao,Danfeng Hong,Meiqiao Bi

doi:10.1109/tgrs.2023.3283508

Abstract

The land cover classification of single-modal remote sensing (RS) data has recently reached a bottleneck. The joint use of multi-modal RS data to improve classification performances has received much attention. Convolutional Neural Networks are powerful tools in feature extraction and contextual modeling. While they have attendant drawbacks to capture the sequence attributes of spectral signatures and struggle to acquire discriminative spectral-spatial features from a global perspective due to limitations inherent in their network backbones. The transformer backbone is a promising approach for addressing these challenges and generating novel insights in multi-modal RS image classification. In this article, we present a new model called Multi-modal Transformer Network (MTNet) that leverages transformer advantages to capture both the specific and shared characteristics of hyperspectral (HS) and light detection and ranging (LiDAR) data. HS images contain a wide range of bands with rich spectral information and LiDAR data provide accurate elevation information without affecting by environmental factors. The well-designed module Hyperspectral Spectral Transformer can learn spectrally local sequence information from neighbouring bands of HS images, yielding group-wise spectral embeddings comprising rich diagnostic information about land covers. Furthermore, the HS and LiDAR spatial transformers aim to mine the pixel-wise feature embedding relationships in a global manner, capturing spatial and elevation information of HS and LiDAR, respectively. Finally, the feature embedding tokens of two modalities are integrated jointly and a new transformer encoder is redesigned to explore the shared spatial characteristics between the two modalities. We evaluate the classification performances of the proposed MTNet on three public HS-LiDAR datasets by conducting extensive experiments, exhibiting superiority over conventional classifiers and state-of-the-art networks.

Full Text