Abstract

Remote-sensing image (RSI) captioning aims to automatically generate sentences describing the content of RSIs. The multiscale information of RSIs contains attributes and complex relationships of objects of different sizes. However, current methods still have some weaknesses in efficiently utilizing multiscale information to generate accurate and detailed sentences. In this letter, we propose a new model based on the “encoder–decoder” framework to address the problem. In the encoder, we fuse the features of different layers in ResNet-50 to extract multiscale information. In the decoder, we propose multilayer aggregated transformer (MLAT) to utilize the extracted information to generate sentences sufficiently. Specially, as the transformer encoding layer goes deeper, the extracted features will be more similar. To sufficiently utilize the features from different transformer encoding layers, compress redundant information, and extract important information, long short-term memory (LSTM) in MLAT aggregates the features to obtain better feature representations. The self-attention mechanism and the aggregation strategy enable MLAT to utilize the features sufficiently. The experimental results show that MLAT as the decoder can help the model address the multiscale problem, significantly improve the model performance on sentence accuracy and diversity, and show that our proposed method performs better than other current methods. Our code is available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/Chen-Yang-Liu/MLAT</uri> .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call