Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning

Wanru Xu,Yi Tian,Zhenjiang Miao,Jian Yu,Qiang Ji,Lili Wan

doi:10.1109/tcsvt.2022.3165934

Abstract

Video captioning is a joint task of computer vision and natural language processing, which aims to describe the video content using several natural language sentences. Nowadays, most methods cast this task as a mapping problem, which learns a mapping from visual features to natural language and generates captions directly from videos. However, the underlying challenge of video captioning, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i</i> .e., sequence to sequence mapping across the different domains, is still not well handled. To address these problems, we introduce the polishing mechanism in an attempt to mimic human polishing process and propose a generate-and-polish framework for video captioning. In this paper, we propose a two-step transformer based polishing network (TSTPN) consisting of two sub-modules: the generation-module is to generate the caption candidate and the polishing-module is to gradually refine the generated candidate. Specifically, the candidate provides a global information of the visual contents in a semantically-meaningful order, where it is firstly considered as a semantic intersnubber to bridge the semantic gap between the text and video, with the cross-modal attention mechanism for better cross-modal modeling; and it secondly provides a global planning ability to maintain the semantic consistency and fluency of the whole sentence for better sequence mapping. In experiments, we present adequate evaluations to show that the proposed TSTPN achieves the comparable and even better performance than the state-of-the-art methods on the benchmark datasets.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society

Lead the way for us

Journal: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society	Publication Date: Sep 1, 2022
Citations: 9

Similar Papers

Video Captioning with Object-Aware Spatio-Temporal Correlation and Aggregation.
Junchao Zhang ... Yuxin Peng
IEEE Transactions on Image Processing | VOL. 29
Junchao Zhang, et. al.Junchao Zhang ... Yuxin Peng
01 Jan 2020
IEEE Transactions on Image Processing | VOL. 29

Review Of Video Captioning Methods
Dewarthi Mahajan ... Sakshi Bhosale
INTERNATIONAL JOURNAL OF NEXT-GENERATION COMPUTING | VOL. -
Dewarthi Mahajan, et. al.Dewarthi Mahajan ... Sakshi Bhosale
26 Nov 2021
INTERNATIONAL JOURNAL OF NEXT-GENERATION COMPUTING | VOL. -

Enhanced transformer model for video caption generation
Soumya Varma ... J Dinesh Peter
Expert Systems | VOL. -
Soumya Varma, et. al.Soumya Varma ... J Dinesh Peter
05 Jul 2023
Expert Systems | VOL. -

Video Captioning Using Neural Networks
Prathamesh Padmawar ... Aditya Hol
International Journal for Research in Applied Science and Engineering Technology | VOL. 10
Prathamesh Padmawar, et. al.Prathamesh Padmawar ... Aditya Hol
31 May 2022
International Journal for Research in Applied Science and Engineering Technology | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Bridging Video and Text: A Two-Step Polishing Transformer for Video Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on circuits and systems for video technology : a publication of the Circuits and Systems Society