Abstract

Remote sensing image captioning involves generating a concise textual description for an input aerial image. The task has received significant attention, and several recent proposals are based on neural encoder-decoder models. Most previous methods are trained to generate discrete outputs corresponding to word tokens that match the reference sentences word-by-word, thereby optimizing the generation locally at token-level instead of globally at sentence-level. This paper explores an alternative generation method based on continuous outputs, which generates sequences of embedding vectors instead of directly predicting word tokens. We argue that continuous output models have the potential to better capture the global semantic similarity between captions and images, e.g. by facilitating the use of loss functions matching different views of the data. This includes comparing representations for individual tokens and for the entire captions, and also comparing captions against intermediate image representations. We experimentally compare discrete versus continuous output captioning methods over the UCM and RSICD datasets, which are extensively used in the area despite some issues which we also discuss. Results show that the alternative encoder-decoder framework with continuous outputs can indeed lead to better results on the two datasets, compared to the standard approach based on discrete outputs. The proposed approach is also competitive against the state-of-the-art model in the area.

Highlights

  • T HE idea of interacting with remote sensing imagery through natural language has been gaining increased interest [1], [2], [3]

  • We propose a novel encoder-decoder framework for remote sensing image captioning, using continuous outputs in the decoder, together with a strong image encoder pre-trained with in-domain data, that predicts continuous representations through natural language supervision

  • We argue that the use of continuous outputs can facilitate the optimization in terms of semantic similarity towards the reference captions in the training data

Read more

Summary

INTRODUCTION

T HE idea of interacting with remote sensing imagery through natural language has been gaining increased interest [1], [2], [3]. We argue that there are other possible advantages to language generation methods leveraging continuous outputs These approaches can facilitate the use of novel loss functions to optimize semantic similarity at different granularities: token level, sentence level, and in terms of image-vs-text similarity. We advanced a novel decoding strategy that explores continuous outputs, going beyond standard beam search by evaluating the generated sentences according to similarity towards the input image Both these ideas extend the work of Kumar and Tsvetkov [4] for text generation with continuous outputs, which only used greedy decoding with a token-level loss.

RELATED WORK
Previous Method
GENERATION WITH CONTINUOUS OUTPUTS
EXPERIMENTAL EVALUATION
EXPERIMENTAL RESULTS
Methods
Findings
CONCLUSIONS AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call