Improving Image Captioning with Better Use of Caption

Zhan Shi,Xiaodan Zhu,Xipeng Qiu,Xu Zhou

doi:10.18653/v1/2020.acl-main.664

Abstract

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. The representation is then enhanced with neighbouring and contextual nodes with their textual and visual features. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences. We perform extensive experiments on the MSCOCO dataset, showing that the proposed framework significantly outperforms the baselines, resulting in the state-of-the-art performance under a wide range of evaluation metrics. The code of our paper has been made publicly available.

Highlights

Generating a short description for a given image, a problem known as image captioning (Chen et al, 2015), has drawn extensive attention in both the natural language processing and computer vision community
These approaches obtain visual relationship graphs using models pretrained from visual relationship detection (VRD) datasets, e.g., Visual Genome (Krishna et al, 2017), where the visual relationships capture semantics between pairs of localized objects connected by predicates, including spatial and non-spatial semantic relationships (Lu et al, 2016)
Image Captioning A prevalent paradigm of existing image captioning methods is based on the encoder-decoder framework which often utilizes a convolution neural networks (CNN)-plus-recurrent neural networks (RNN) architecture for image encoding and text generation (Donahue et al, 2015; Vinyals et al, 2015; Karpathy and Fei-Fei, 2015)

Summary

Introduction

Generating a short description for a given image, a problem known as image captioning (Chen et al, 2015), has drawn extensive attention in both the natural language processing and computer vision community. Once the visual relationship graphs (VRG) are built, we propose to adapt graph convolution operations (Marcheggiani and Titov, 2017) to obtain representation for object nodes and predicate nodes. These nodes can be viewed as image representation units used for generation. Our models consist of three major components: constructing caption-guided visual relationship graphs (CGVRG) with weaklysupervised multi-instance learning, building context-aware CGVRG, and performing multi-task generation to regularize the network to take into account explicit predicate object/predicate constraints. Unlike existing models, we propose multi-task learning to regularize the network to take into account explicit object/predicate constraints in the process of generation.

Related Work

The Models

Extracting Visual Relationship Triples and Detecting Objects

Constructing CGVRG

Context-Aware CGVRG

Multi-task Caption Generation

Multi-task Learning

Training and Inference

Datasets and Experiment Setup

B4 ME RG CD SP B1 B4 ME RG CD SP

Quantitative Analysis

21.2 SPICE 21

Qualitative Analysis

Findings

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving Image Captioning with Better Use of Caption

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 76	License type: cc-by

Similar Papers

Synthesis of Vision and Language: Multifaceted Image Captioning Application
Arpit Gupta ... Ishita Kohli
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 07
Arpit Gupta, et. al.Arpit Gupta ... Ishita Kohli
23 Dec 2023
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 07

TSIC-CLIP: Traffic Scene Image Captioning Model Based on Clip
Hao Zhang ... Xuewei Li
Information Technology and Control | VOL. 53
Hao Zhang, et. al.Hao Zhang ... Xuewei Li
22 Mar 2024
Information Technology and Control | VOL. 53

Image caption generation with dual attention mechanism
Maofu Liu ... Jing Tian
Information Processing and Management | VOL. 57
Maofu Liu, et. al.Maofu Liu ... Jing Tian
12 Dec 2019
Information Processing and Management | VOL. 57

Detection and Caption Generation of Image Using Deep Learning
.Jivan Mate ...
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 08
.Jivan Mate, et. al..Jivan Mate ...
24 Apr 2024
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 08

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving Image Captioning with Better Use of Caption

Abstract

Highlights

Summary

Talk to us

Similar Papers