VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning

Qimin Cheng,Yuqi Xu,Ziyang Huang

doi:10.3390/rs16162961

Abstract

Pioneering remote sensing image captioning (RSIC) works use autoregressive decoding for fluent and coherent sentences but suffer from high latency and high computation costs. In contrast, non-autoregressive approaches improve inference speed by predicting multiple tokens simultaneously, though at the cost of performance due to a lack of sequential dependencies. Recently, diffusion model-based non-autoregressive decoding has shown promise in natural image captioning with iterative refinement, but its effectiveness is limited by the intrinsic characteristics of remote sensing images, which complicate robust input construction and affect the description accuracy. To overcome these challenges, we propose an innovative diffusion model for RSIC, named the Visual Conditional Control Diffusion Network (VCC-DiffNet). Specifically, we propose a Refined Multi-scale Feature Extraction (RMFE) module to extract the discernible visual context features of RSIs as input of the diffusion model-based non-autoregressive decoder to conditionally control a multi-step denoising process. Furthermore, we propose an Interactive Enhanced Decoder (IE-Decoder) utilizing dual image–description interactions to generate descriptions finely aligned with the image content. Experiments conducted on four representative RSIC datasets demonstrate that our non-autoregressive VCC-DiffNet performs comparably to, or even better than, popular autoregressive baselines in classic metrics, achieving around an 8.22× speedup in Sydney-Captions, an 11.61× speedup in UCM-Captions, a 15.20× speedup in RSICD, and an 8.13× speedup in NWPU-Captions.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning

Abstract

Talk to us

Similar Papers

More From: Remote Sensing

Lead the way for us

Journal: Remote Sensing	Publication Date: Aug 12, 2024
License type: CC BY 4.0

Similar Papers

Recurrent Attention and Semantic Gate for Remote Sensing Image Captioning
Yunpeng Li ... Xin Wang
IEEE Transactions on Geoscience and Remote Sensing | VOL. 60
Yunpeng Li, et. al.Yunpeng Li ... Xin Wang
01 Jan 2021
IEEE Transactions on Geoscience and Remote Sensing | VOL. 60

Remote sensing image caption generation via transformer and reinforcement learning
Xiangqing Shen ... Jiaqi Zhao
Multimedia Tools and Applications | VOL. 79
Xiangqing Shen, et. al.Xiangqing Shen ... Jiaqi Zhao
17 Jul 2020
Multimedia Tools and Applications | VOL. 79

Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning
Xiangqing Shen ... Mingming Liu
Knowledge-Based Systems | VOL. 203
Xiangqing Shen, et. al.Xiangqing Shen ... Mingming Liu
23 Apr 2020
Knowledge-Based Systems | VOL. 203

Meta captioning: A meta learning based remote sensing image captioning framework
Qiaoqiao Yang ... Peng Ren
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 186
Qiaoqiao Yang, et. al.Qiaoqiao Yang ... Peng Ren
25 Feb 2022
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 186

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning

Abstract

Talk to us

Similar Papers

More From: Remote Sensing