Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework

Jing Yao,Bing Zhang,Jocelyn Chanussot,Chenyu Li,Danfeng Hong

doi:10.1109/tgrs.2023.3284671

Abstract

The recent success of attention mechanism-driven deep models, like Vision Transformer (ViT) as one of the most representative, has intrigued a wave of advanced research to explore their adaptation to broader domains. However, current Transformer-based approaches in the remote sensing (RS) community pay more attention to single-modality data, which might lose expandability in making full use of the ever-growing multimodal Earth observation data. To this end, we propose a novel multimodal deep learning framework by extending conventional ViT with minimal modifications, abbreviated as ExViT, aiming at the task of land use and land cover classification. Unlike common stems that adopt either linear patch projection or deep regional embedder, our approach processes multimodal RS image patches with parallel branches of position-shared ViTs extended with separable convolution modules, which offers an economical solution to leverage both spatial and modality-specific channel information. Furthermore, to promote information exchange across heterogeneous modalities, their tokenized embeddings are then fused through a cross-modality attention module by exploiting pixel-level spatial correlation in RS scenes. Both of these modifications significantly improve the discriminative ability of classification tokens in each modality and thus further performance increase can be finally attained by a full tokens-based decision-level fusion module. We conduct extensive experiments on two multimodal RS benchmark datasets, i.e., the Houston2013 dataset containing hyperspectral and light detection and ranging (LiDAR) data, and Berlin dataset with hyperspectral and synthetic aperture radar (SAR) data, to demonstrate that our ExViT outperforms concurrent competitors based on Transformer or convolutional neural network (CNN) backbones, in addition to several competitive machine learning-based models. The source codes and investigated datasets of this work will be made publicly available at https://github.com/jingyao16/ExViT.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society

Lead the way for us

Journal: IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society	Publication Date: Jan 1, 2023
Citations: 88

Similar Papers

More Diverse Means Better: Multimodal Deep Learning Meets Remote-Sensing Imagery Classification
Danfeng Hong ... Naoto Yokoya
IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society | VOL. 59
Danfeng Hong, et. al.Danfeng Hong ... Naoto Yokoya
16 Aug 2020
IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society | VOL. 59

Convolutional Neural Networks for Multimodal Remote Sensing Data Classification
Xin Wu ... Jocelyn Chanussot
IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society | VOL. 60
Xin Wu, et. al.Xin Wu ... Jocelyn Chanussot
01 Jan 2021
IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society | VOL. 60

Multimodal Transformer Network for Hyperspectral and LiDAR Classification
Yiyan Zhang ... Chenkai Zhang
IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society | VOL. 61
Yiyan Zhang, et. al.Yiyan Zhang ... Chenkai Zhang
01 Jan 2023
IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society | VOL. 61

Remote sensing of terrestrial non-photosynthetic vegetation using hyperspectral, multispectral, SAR, and LiDAR data
Zhaoqin Li ... Xulin Guo
Progress in Physical Geography: Earth and Environment | VOL. 40
Zhaoqin Li, et. al.Zhaoqin Li ... Xulin Guo
20 May 2015
Progress in Physical Geography: Earth and Environment | VOL. 40

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on geoscience and remote sensing : a publication of the IEEE Geoscience and Remote Sensing Society