Abstract

Remote sensing image captioning (RSIC) is cross-modal interaction task in an artificial intelligence that leads to automatic description of Earth’s geological properties captured from an aerial view. It is noted that, convolutional neural network (CNN) and recurrent neural network (RNN) based encoder–decoder methods are widely adopted for RSIC, but has two main constrains: first, insufficient to capture inherent geographical characteristics due to single level static convolutional features; second, difficult to train regressive time-step sequences. To address these challenges, a novel fully-attentive framework entitled Spatial-Channel Attention based MEmory-guided Transformer (SCAMET) is proposed, which calibrates multilevel visual attentive features and aligns with linguistic information through persistent memory. Here, CNN is integrated with Transformer to generate captions for remote sensing image. To comprehend deeper semantic knowledge of multi-scale, multi-shape, multi-object in remote sensing image, multi-attentive visual features are extracted by employing spatial and channel attention separately. To decode multi-attentive feature into caption, this work proposes memory-guided Transformer as linguistic decoder. Specifically, learnable memory elements are incorporated in multi-head attention block, which perceives intrinsic association within visual multi-attentive features and reconciles with linguistic information. The ablation studies are conducted on three public RSIC datasets, Sydney-captions, UCM-captions and RSICD to evaluate performance of proposed method. The quantitative and qualitative analyses reveal that proposed method performs satisfactory compared to state-of-the-art approaches. This work also proposes a “Weighted Mean Score” index to evaluate conclusive performance of model across all datasets by leveraging global contribution of each test set. The implementation of proposed work is available at: https://github.com/GauravGajbhiye/SCAMET_RSIC.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call