Abstract

AbstractAs a multi-modal task, remote sensing image captioning (RSIC) plays an essential role in image understanding. However, most current RSIC methods ignore sufficient utilization of image features. we propose an improved image description method with improved attention mechanism, named attention-guided visual semantic fusion (AVSF) method for RSIC, based on the encoder-decoder framework. Attention mechanism contributes to obtaining the relationship between global and local image, as well as the relationship between image region and corresponding text, which is more conducive to the generation of description. Firstly, convolutional neural network VGG16 is introduced as the main network to construct the encoder and extract the features for remote sensing images. Secondly, image features are fused with text features, and attention mechanism is introduced to enhance the attention to vision features. Then, long short-term memory is used to build a decoder for learning the mapping relationships between image visual features and text semantic features. The experiments show that evaluation metrics have been improved on the UCM and Sydney datasets.KeywordsImage captioningRemote sensingFeature fusionAttention mechanism

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.