Abstract

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. The representation is then enhanced with neighbouring and contextual nodes with their textual and visual features. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences. We perform extensive experiments on the MSCOCO dataset, showing that the proposed framework significantly outperforms the baselines, resulting in the state-of-the-art performance under a wide range of evaluation metrics. The code of our paper has been made publicly available.

Highlights

  • Generating a short description for a given image, a problem known as image captioning (Chen et al, 2015), has drawn extensive attention in both the natural language processing and computer vision community

  • These approaches obtain visual relationship graphs using models pretrained from visual relationship detection (VRD) datasets, e.g., Visual Genome (Krishna et al, 2017), where the visual relationships capture semantics between pairs of localized objects connected by predicates, including spatial and non-spatial semantic relationships (Lu et al, 2016)

  • Image Captioning A prevalent paradigm of existing image captioning methods is based on the encoder-decoder framework which often utilizes a convolution neural networks (CNN)-plus-recurrent neural networks (RNN) architecture for image encoding and text generation (Donahue et al, 2015; Vinyals et al, 2015; Karpathy and Fei-Fei, 2015)

Read more

Summary

Introduction

Generating a short description for a given image, a problem known as image captioning (Chen et al, 2015), has drawn extensive attention in both the natural language processing and computer vision community. Once the visual relationship graphs (VRG) are built, we propose to adapt graph convolution operations (Marcheggiani and Titov, 2017) to obtain representation for object nodes and predicate nodes. These nodes can be viewed as image representation units used for generation. Our models consist of three major components: constructing caption-guided visual relationship graphs (CGVRG) with weaklysupervised multi-instance learning, building context-aware CGVRG, and performing multi-task generation to regularize the network to take into account explicit predicate object/predicate constraints. Unlike existing models, we propose multi-task learning to regularize the network to take into account explicit object/predicate constraints in the process of generation.

Related Work
The Models
Extracting Visual Relationship Triples and Detecting Objects
Constructing CGVRG
Context-Aware CGVRG
Multi-task Caption Generation
Multi-task Learning
Training and Inference
Datasets and Experiment Setup
B4 ME RG CD SP B1 B4 ME RG CD SP
Quantitative Analysis
21.2 SPICE 21
Qualitative Analysis
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call