Abstract
Standard image caption generation systems produce generic descriptions of images and do not utilize any contextual information or world knowledge. In particular, they are unable to generate captions that contain references to the geographic context of an image, for example, the location where a photograph is taken or relevant geographic objects around an image location. In this paper, we develop a geo-aware image caption generation system, which incorporates geographic contextual information into a standard image captioning pipeline. We propose a way to build an image-specific representation of the geographic context and adapt the caption generation network to produce appropriate geographic names in the image descriptions. We evaluate our system on a novel captioning dataset that contains contextualized captions and geographic metadata and achieve substantial improvements in BLEU, ROUGE, METEOR and CIDEr scores. We also introduce a new metric to assess generated geographic references directly and empirically demonstrate our system’s ability to produce captions with relevant and factually accurate geographic referencing.
Highlights
Image caption generation is a popular task that aims at producing a natural language description of a given image
A standard neural image captioning system consists of two stages: an “encoder”, a Convolutional Neural Network that encodes the visual features of an image as a vector, and a “decoder”, a language model that is initialized with this vector and generates a caption word by word
In this paper we present geo-aware image captioning, where geographic contextual information is incorporated into the generated captions
Summary
Image caption generation is a popular task that aims at producing a natural language description of a given image. A standard neural image captioning system consists of two stages: an “encoder”, a Convolutional Neural Network that encodes the visual features of an image as a vector, and a “decoder”, a language model that is initialized with this vector and generates a caption word by word. People tend to describe images interpreting them based on context factors and world knowledge, while standard encoder-decoder captioning systems do not take any contextual or world knowledge into account. One of the aspects that are missing from standard caption generation systems is the ability to produce image descriptions influenced by the geographic context, i.e. geographic objects surrounding the image location. Consider the photograph in Figure 1: Automatically generated: a park bench sitting in the middle of a park
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.