Joint Commonsense and Relation Reasoning for Image and Video Captioning

Jingyi Hou,Xinxiao Wu,Xiaoxun Zhang,Yunde Jia,Yayun Qi,Jiebo Luo

doi:10.1609/aaai.v34i07.6731

Abstract

Exploiting relationships between objects for image and video captioning has received increasing attention. Most existing methods depend heavily on pre-trained detectors of objects and their relationships, and thus may not work well when facing detection challenges such as heavy occlusion, tiny-size objects, and long-tail classes. In this paper, we propose a joint commonsense and relation reasoning method that exploits prior knowledge for image and video captioning without relying on any detectors. The prior knowledge provides semantic correlations and constraints between objects, serving as guidance to build semantic graphs that summarize object relationships, some of which cannot be directly perceived from images or videos. Particularly, our method is implemented by an iterative learning algorithm that alternates between 1) commonsense reasoning for embedding visual regions into the semantic space to build a semantic graph and 2) relation reasoning for encoding semantic graphs to generate sentences. Experiments on several benchmark datasets validate the effectiveness of our prior knowledge-based approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Joint Commonsense and Relation Reasoning for Image and Video Captioning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Apr 3, 2020
Citations: 30

Similar Papers

Automatic Image and Video Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap
Soheyla Amirian ... Hamid R Arabnia
IEEE Access | VOL. 8
Soheyla Amirian, et. al.Soheyla Amirian ... Hamid R Arabnia
01 Jan 2020
IEEE Access | VOL. 8

Towards Unified Deep Learning Model for NSFW Image and Video Captioning
Jong-Won Ko ... Dong-Hyun Hwang
-
Jong-Won Ko, et. al.Jong-Won Ko ... Dong-Hyun Hwang
29 Nov 2018
29 Nov 2018

Fully-attentive iterative networks for region-based controllable image and video captioning
Marcella Cornia ... Rita Cucchiara
Computer Vision and Image Understanding | VOL. 237
Marcella Cornia, et. al.Marcella Cornia ... Rita Cucchiara
05 Oct 2023
Computer Vision and Image Understanding | VOL. 237

A grey relational analysis based evaluation metric for image captioning and video captioning
Miao Ma ... Bolong Wang
-
Miao Ma, et. al.Miao Ma ... Bolong Wang
01 Aug 2017
01 Aug 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Joint Commonsense and Relation Reasoning for Image and Video Captioning

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence