Entity Slot Filling for Visual Captioning

Yi Bin,Yang Yang,Yujuan Ding,Bo Peng,Liang Peng,Tat-Seng Chua

doi:10.1109/tcsvt.2021.3063297

Abstract

To explore the specific visual aspects and the language consistency at the same time, this paper introduces a new image captioning task, dubbed <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">entity slot filling captioning (ESFCap)</i> . It is similar to the masked entity completion tasks in NLP, which are widely used to study language context and has been successfully employed to improve language understanding. Specifically, given a sentence with blank for describing an image, the ESFCap task aims to fill the blank with proper text content according to the visual information. The filled text should be grounded to correct visual entities and also in concordance with the sentence structure. To support the ESFCap research, we collect and release an entity slot filling captioning dataset, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Flickr30k-EnFi</i> , based on Flickr30k-Entities. The Flickr30k-EnFi dataset consists of 31,783 images and 565,750 masked sentences, as well as the text snippets for the masked slot. For tackling the ESFCap task, we propose a multi-modal fusion model equipped with a novel adaptive dynamic attention module, termed AdaMFN. The AdaMFN model effectively leverages both global and local information from vision and language. It is also able to adaptively focus on the key linguistic knowledge and visual regions to generate correct filling results. The experimental results and analysis demonstrate the effectiveness of our proposed model.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Entity Slot Filling for Visual Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society

Lead the way for us

Journal: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society	Publication Date: Mar 3, 2021
Citations: 20

Similar Papers

Hero-Gang Neural Model For Named Entity Recognition
Jinpeng Hu ... Tsung-Hui Chang
-
Jinpeng Hu, et. al.Jinpeng Hu ... Tsung-Hui Chang
01 Jan 2021
01 Jan 2021

Hero-Gang Neural Model For Named Entity Recognition
...
-
, et. al. ...
27 Jun 2022
27 Jun 2022

Active Contour Model Integrating Global and Local Information
Teng Wu ... Jiaxin Wang
-
Teng Wu, et. al.Teng Wu ... Jiaxin Wang
12 Mar 2021
12 Mar 2021

Local and global information affect cooperation in networked Prisoner’s dilemma games
M. Zhang ... K. Alfaro-Bittner
Chaos, Solitons, & Fractals | VOL. 150
M. Zhang, et. al.M. Zhang ... K. Alfaro-Bittner
01 Sep 2021
Chaos, Solitons, & Fractals | VOL. 150

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Entity Slot Filling for Visual Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society