Abstract

Most of the remote sensing image captioning (IC) models are based on encoder–decoder frameworks where a convolutional neural network (CNN) encodes the image information and a recurrent neural network (RNN) decodes the image information into a sentence description. In order to achieve good accuracies, encoder–decoder frameworks relying on RNNs typically require a huge amount of annotated samples. Furthermore, they demand high and expensive computational power in order to have reasonable training and testing time. In this article, we aim to address these issues by introducing a novel decoder that is based on support vector machines (SVMs). In particular, instead of RNNs, we propose a novel network of SVMs to decode the image information into a sentence description. The proposed IC system is particularly interesting when just a limited amount of training samples is available. Experiments conducted on four different IC datasets confirm the promising capability of the proposed IC system to generate descriptions that are highly correlated with the image content. The proposed IC system is characterized by short training and inference times compared to other state-of-the-art models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call