Abstract

For reducing the negative impact of the gradient diminishing on guiding long-short term memory (gLSTM) model in image captioning, we propose a visual enhanced gLSTM model for image caption generation. In this paper, the visual features of image’s region of interest (RoI) are extracted and used as guiding information in gLSTM, in which visual information of RoI is added to gLSTM for generating more accurate image captions. Two visual enhanced methods based on region and entire image are proposed respectively. Among them the visual features from the important semantic region by CNN and the full image visual features by visual words are extracted to guide the LSTM for generating the most important semantic words. Then the visual features and text features of similar images are respectively projected to the common semantic space to obtain visual enhancement guiding information by canonical correlation analysis, and added to each memory cell of gLSTM for generating caption words. Compared with the original gLSTM method, visual enhanced gLSTM model focuses on important semantic region, which is more in line with human perception of images. Experiments on Flickr8k dataset illustrate that the proposed method can achieve more accurate image captions, and outperform the baseline gLSTM algorithm and other popular image captioning methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.