Pioneering remote sensing image captioning (RSIC) works use autoregressive decoding for fluent and coherent sentences but suffer from high latency and high computation costs. In contrast, non-autoregressive approaches improve inference speed by predicting multiple tokens simultaneously, though at the cost of performance due to a lack of sequential dependencies. Recently, diffusion model-based non-autoregressive decoding has shown promise in natural image captioning with iterative refinement, but its effectiveness is limited by the intrinsic characteristics of remote sensing images, which complicate robust input construction and affect the description accuracy. To overcome these challenges, we propose an innovative diffusion model for RSIC, named the Visual Conditional Control Diffusion Network (VCC-DiffNet). Specifically, we propose a Refined Multi-scale Feature Extraction (RMFE) module to extract the discernible visual context features of RSIs as input of the diffusion model-based non-autoregressive decoder to conditionally control a multi-step denoising process. Furthermore, we propose an Interactive Enhanced Decoder (IE-Decoder) utilizing dual image–description interactions to generate descriptions finely aligned with the image content. Experiments conducted on four representative RSIC datasets demonstrate that our non-autoregressive VCC-DiffNet performs comparably to, or even better than, popular autoregressive baselines in classic metrics, achieving around an 8.22× speedup in Sydney-Captions, an 11.61× speedup in UCM-Captions, a 15.20× speedup in RSICD, and an 8.13× speedup in NWPU-Captions.
Read full abstract