Remote sensing image captioning involves generating a concise textual description for an input aerial image. The task has received significant attention, and several recent proposals are based on neural encoder-decoder models. Most previous methods are trained to generate discrete outputs corresponding to word tokens that match the reference sentences word-by-word, thereby optimizing the generation locally at token-level instead of globally at sentence-level. This paper explores an alternative generation method based on continuous outputs, which generates sequences of embedding vectors instead of directly predicting word tokens. We argue that continuous output models have the potential to better capture the global semantic similarity between captions and images, e.g. by facilitating the use of loss functions matching different views of the data. This includes comparing representations for individual tokens and for the entire captions, and also comparing captions against intermediate image representations. We experimentally compare discrete versus continuous output captioning methods over the UCM and RSICD datasets, which are extensively used in the area despite some issues which we also discuss. Results show that the alternative encoder-decoder framework with continuous outputs can indeed lead to better results on the two datasets, compared to the standard approach based on discrete outputs. The proposed approach is also competitive against the state-of-the-art model in the area.