Concept Parser With Multimodal Graph Learning for Video Captioning

Bofeng Wu,Jun Yu,Peng Huang,Peng Xi,Jun Bao,Buyu Liu

doi:10.1109/tcsvt.2023.3277827

Abstract

Conventional video captioning methods are either stage-wise or simple end-to-end. While the former might introduce additional noise when exploiting off-the-shelf models to provide extra information, the latter suffers from lacking high-level cues. Therefore, a more desired framework should be able to capture multi-aspects of videos consistently. To this end, we present a concept-aware and task-specific model named CAT that accounts for both low-level visual and high-level concept cues, and incorporates them effectively in an end-to-end manner. Specifically, low-level visual and high-level concept features are obtained from the video transformer and concept parser of CAT. And a concept loss is further introduced to regularize the learning process of concept parser w.r.t. generated pseudo ground truth. To combine multi-level features, a caption transformer is later introduced in CAT, where visual and concept features are the inputs and caption is its output. In particular, we make critical design choices in the caption transformer to learn to exploit these cues with a multi-modal graph. This is achieved by a graph loss that enforces effective learning of intra and inter correlations between multi-level cues. Extensive experiments on three benchmark datasets demonstrate that CAT achieves 2.3 and 0.7 improvements in the CIDEr metric on MSVD and MSR-VTT compared to the state-of-the-art method SwinBERT [1], and also achieves a competitive result on VATEX.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Concept Parser With Multimodal Graph Learning for Video Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society

Lead the way for us

Journal: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society	Publication Date: Sep 1, 2023
Citations: 2

Similar Papers

An attention based dual learning approach for video captioning
Wanting Ji ... Xun Wang
Applied Soft Computing Journal | VOL. 117
Wanting Ji, et. al.Wanting Ji ... Xun Wang
21 Dec 2021
Applied Soft Computing Journal | VOL. 117

When textual and visual information join forces for multimedia retrieval
Bahjat Safadi ... Benoit Huet
-
Bahjat Safadi, et. al.Bahjat Safadi ... Benoit Huet
01 Apr 2014
01 Apr 2014

Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning
Jingyi Hou ... Jiebo Luo
-
Jingyi Hou, et. al.Jingyi Hou ... Jiebo Luo
01 Oct 2019
01 Oct 2019

A New Memory Based on Sequence to Sequence Model for Video Captioning
Jin-Cheng Lin ... Chun-Yang Zhang
-
Jin-Cheng Lin, et. al.Jin-Cheng Lin ... Chun-Yang Zhang
18 Jun 2021
18 Jun 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Concept Parser With Multimodal Graph Learning for Video Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society