Abstract

Automatically describing image content with natural language is a fundamental challenge for computer vision community. General methods used visual information to generate sentences directly. However, only depending on the visual information is not enough to generate the fine-grained descriptions for given images. In this paper, we exploit the fusion of visual information and high-level semantic information for image captioning. We propose a hierarchical deep neural network, which consists of the bottom layer and the top layer. The former extracts the visual and high-level semantic information from image and detected regions, respectively, while the latter integrates both of them with adaptive attention mechanism for the caption generation. The experimental results achieve the competing performances against the state-of-the-art methods on MSCOCO dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.