Fully-attentive iterative networks for region-based controllable image and video captioning

Marcella Cornia,Lorenzo Baraldi,Ayellet Tal,Rita Cucchiara

doi:10.1016/j.cviu.2023.103857

Abstract

Controllable image captioning has recently gained attention as a way to increase the diversity and the applicability to real-world scenarios of image captioning algorithms. In this task, a captioner is conditioned on an external control signal, which needs to be followed during the generation of the caption. We aim to overcome the limitations of current controllable captioning methods by proposing a fully-attentive and iterative network that can generate grounded and controllable captions from a control signal given as a sequence of visual regions from the image. Our architecture is based on a set of novel attention operators, which take into account the hierarchical nature of the control signal, and is endowed with a decoder which explicitly focuses on each part of the control signal. We demonstrate the effectiveness of the proposed approach by conducting experiments on three datasets, where our model surpasses the performances of previous methods and achieves a new state of the art on both image and video controllable captioning.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computer Vision and Image Understanding	Publication Date: Oct 5, 2023
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Fully-attentive iterative networks for region-based controllable image and video captioning

Abstract

Talk to us

Similar Papers

More From: Computer Vision and Image Understanding

Lead the way for us

Similar Papers

Automatic Image and Video Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap
Soheyla Amirian ... Hamid R Arabnia
IEEE Access | VOL. 8
Soheyla Amirian, et. al.Soheyla Amirian ... Hamid R Arabnia
01 Jan 2020
IEEE Access | VOL. 8

Visual saliency for image captioning in new multimedia services
Marcella Cornia ... Rita Cucchiara
-
Marcella Cornia, et. al.Marcella Cornia ... Rita Cucchiara
01 Jul 2017
01 Jul 2017

Towards Unified Deep Learning Model for NSFW Image and Video Captioning
Jong-Won Ko ... Dong-Hyun Hwang
-
Jong-Won Ko, et. al.Jong-Won Ko ... Dong-Hyun Hwang
29 Nov 2018
29 Nov 2018

Normalized and Geometry-Aware Self-Attention Network for Image Captioning
Longteng Guo ... Peng Yao
-
Longteng Guo, et. al.Longteng Guo ... Peng Yao
01 Jun 2020
01 Jun 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fully-attentive iterative networks for region-based controllable image and video captioning

Abstract

Talk to us

Similar Papers

More From: Computer Vision and Image Understanding