A Fine-Grained Spatial-Temporal Attention Model for Video Captioning

An-An Liu,Mohan Kankanhalli,Yu-Ting Su,Yurui Qiu,Yongkang Wong

doi:10.1109/access.2018.2879642

An-An Liu, Mohan Kankanhalli + Show 3 more

Open Access

https://doi.org/10.1109/access.2018.2879642

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2018
Citations: 56	License type: cc-by-nc-nd

Affiliation: Tianjin University, National University of Singapore

Abstract

Attention mechanism has been extensively used in video captioning tasks, which enables further development of deeper visual understanding. However, most existing video captioning methods apply the attention mechanism on the frame level, which only model the temporal structure and generated words, but ignore the region-level spatial information that provides accurate visual features corresponding to the semantic content. In this paper, we propose a fine-grained spatial-temporal attention model (FSTA), and the spatial information of objects appearing in the video will be our main concern. In the proposed FSTA, we achieve the spatial-hard attention at a fine-grained region level of objects through the mask pooling module and compute the temporal soft attention by using a two-layer LSTM network with attention mechanism to generate sentences. We test the proposed model on two benchmark datasets, namely, MSVD and MSR-VTT. The results indicate that our proposed FSTA model can achieve competitive performance against the state of the arts on both datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Fine-Grained Spatial-Temporal Attention Model for Video Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

KSF-ST: Video Captioning Based on Key Semantic Frames Extraction and Spatio-Temporal Attention Mechanism
Zhaowei Qu ... Xiaoru Wang
-
Zhaowei Qu, et. al.Zhaowei Qu ... Xiaoru Wang
01 Jun 2020
01 Jun 2020

Recovering the spectral and spatial information of an object behind a scattering media
Lei Zhu ... Tengfei Wu
OSA Continuum | VOL. 1
Lei Zhu, et. al.Lei Zhu ... Tengfei Wu
01 Oct 2018
OSA Continuum | VOL. 1

An attention based dual learning approach for video captioning
Wanting Ji ... Xun Wang
Applied Soft Computing | VOL. 117
Wanting Ji, et. al.Wanting Ji ... Xun Wang
21 Dec 2021
Applied Soft Computing | VOL. 117

Motion Guided Spatial Attention for Video Captioning
Shaoxiang Chen ... Yu-Gang Jiang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 33
Shaoxiang Chen, et. al.Shaoxiang Chen ... Yu-Gang Jiang
17 Jul 2019
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 33

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Fine-Grained Spatial-Temporal Attention Model for Video Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE Access