Abstract

Image captioning is one of the main visual tasks to achieve scene understanding, involving detecting and recognizing objects and their relationships, and describing the image content with natural language. In order to improve the accuracy of the spatial attention mechanism to obtain image features to improve the performance of image captioning, we propose an image captioning model based on semantic feature update, to extract the features of different semantic layers in the scene image and then iteratively update these features to achieve accurate description of the scene image. Firstly, we use Faster R-CNN to extract the object features, visual relationship features between the objects in the image, and the global features of the image. Secondly, we use a feature refining network that takes the object features and relationship features as input to update the object features and relationship features. Finally, the updated features as the input of the spatial attention mechanism are integrated into the description framework to improve the accuracy of image captioning. Experiments on the COCO dataset demonstrate the superiority of our proposed captioning model over other captioning models.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.