Abstract

To explore the specific visual aspects and the language consistency at the same time, this paper introduces a new image captioning task, dubbed <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">entity slot filling captioning (ESFCap)</i> . It is similar to the masked entity completion tasks in NLP, which are widely used to study language context and has been successfully employed to improve language understanding. Specifically, given a sentence with blank for describing an image, the ESFCap task aims to fill the blank with proper text content according to the visual information. The filled text should be grounded to correct visual entities and also in concordance with the sentence structure. To support the ESFCap research, we collect and release an entity slot filling captioning dataset, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Flickr30k-EnFi</i> , based on Flickr30k-Entities. The Flickr30k-EnFi dataset consists of 31,783 images and 565,750 masked sentences, as well as the text snippets for the masked slot. For tackling the ESFCap task, we propose a multi-modal fusion model equipped with a novel adaptive dynamic attention module, termed AdaMFN. The AdaMFN model effectively leverages both global and local information from vision and language. It is also able to adaptively focus on the key linguistic knowledge and visual regions to generate correct filling results. The experimental results and analysis demonstrate the effectiveness of our proposed model.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call