Abstract

There have been several attempts to integrate a spatial visual attention mechanism into an image caption model and introduce semantic concepts as the guidance of image caption generation. High-level semantic information consists of the abstractedness and generality indication of an image, which is beneficial to improve the model performance. However, the high-level information is always static representation without considering the salient elements. Therefore, a semantic attention mechanism is used for the high-level information instead of conventional of static representation in this article. The salient high-level semantic information can be considered as redundant semantic information for image caption generation. Additionally, the generation of visual words and non-visual words can be separated, and an adaptive attention mechanism is employed to realize the guidance information of image caption generation switching between new fusion information (fusion of image feature and high-level semantics) and a language model. Therefore, visual words can be generated according to the image features and high-level semantic information, and non-visual words can be predicted by the language model. The semantics attention, adaptive attention, and previous generated words are fused to construct a special attention module for the input and output of long short-term memory. An image caption can be generated as a concise sentence on the basis of accurately grasping the rich content of the image. The experimental results show that the performance of the proposed model is promising for the evaluation metrics, and the captions can achieve logical and rich descriptions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call