Abstract

Automatic image captioning, a highly challenging research problem, aims to understand and describe the contents of the complex scene in human understandable natural language. The majority of the recent solutions are based on holistic approaches where the scene is described as a whole, potentially losing the important semantic relationship of objects in the scene. We propose Dense-CaptionNet, a region-based deep architecture for fine-grained description of image semantics, which localizes and describes each object/region in the image separately and generates a more detailed description of the scene. The proposed network contains three components which work together to generate a fine-grained description of image semantics. Region descriptions and object relationships are generated by the first module, whereas the second one generates the attributes of objects present in the scene. The textual descriptions obtained as an output of the two modules are concatenated to feed as an input to the sentence generation module, which works on encoder-decoder formulation to generate a grammatically correct but single line, fine-grained description of the whole scene. The proposed Dense-CaptionNet is trained and tested using Visual Genome, MSCOCO, and IAPR TC-12 datasets. The results establish a new state-of-the-art when compared with the existing top performing methodologies, e.g., Up-Down-Captioner, Show, Attend and Tell, Semstyle, and Neural Talk, especially on complex scenes. The implementation has been shared on GitHub for other researchers: http://bit.ly/2VIhfrf

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call