Video Summarization With Spatiotemporal Vision Transformer.

Tzu-Chun Hsu,Chun-Rong Huang,Yi-Sheng Liao

doi:10.1109/tip.2023.3275069

Abstract

Video summarization aims to generate a compact summary of the original video for efficient video browsing. To provide video summaries which are consistent with the human perception and contain important content, supervised learning-based video summarization methods are proposed. These methods aim to learn important content based on continuous frame information of human-created summaries. However, simultaneously considering both of inter-frame correlations among non-adjacent frames and intra-frame attention which attracts the humans for frame importance representations are rarely discussed in recent methods. To address these issues, we propose a novel transformer-based method named spatiotemporal vision transformer (STVT) for video summarization. The STVT is composed of three dominant components including the embedded sequence module, temporal inter-frame attention (TIA) encoder, and spatial intra-frame attention (SIA) encoder. The embedded sequence module generates the embedded sequence by fusing the frame embedding, index embedding and segment class embedding to represent the frames. The temporal inter-frame correlations among non-adjacent frames are learned by the TIA encoder with the multi-head self-attention scheme. Then, the spatial intra-frame attention of each frame is learned by the SIA encoder. Finally, a multi-frame loss is computed to drive the learning of the network in an end-to-end trainable manner. By simultaneously using both inter-frame and intra-frame information, our method outperforms state-of-the-art methods in both of the SumMe and TVSum datasets. The source code of the spatiotemporal vision transformer will be available at https://github.com/nchucvml/STVT.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Video Summarization With Spatiotemporal Vision Transformer.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Image Processing

Lead the way for us

Journal: IEEE Transactions on Image Processing	Publication Date: Jan 1, 2023
Citations: 16

Similar Papers

Unsupervised Video Summarization With Cycle-Consistent Adversarial LSTM Networks
Li Yuan ... Jiashi Feng
IEEE Transactions on Multimedia | VOL. 22
Li Yuan, et. al.Li Yuan ... Jiashi Feng
24 Sep 2020
IEEE Transactions on Multimedia | VOL. 22

Attention Over Attention: An Enhanced Supervised Video Summarization Approach
Isha Puthige ... Mohit Agarwal
Procedia Computer Science | VOL. 218
Isha Puthige, et. al.Isha Puthige ... Mohit Agarwal
01 Jan 2023
Procedia Computer Science | VOL. 218

Joint Video Summarization and Moment Localization by Cross-Task Sample Transfer
Hao Jiang ... Yadong Mu
-
Hao Jiang, et. al.Hao Jiang ... Yadong Mu
01 Jun 2022
01 Jun 2022

Query-controllable Video Summarization
Jia-Hong Huang ... Marcel Worring
-
Jia-Hong Huang, et. al.Jia-Hong Huang ... Marcel Worring
08 Jun 2020
08 Jun 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Video Summarization With Spatiotemporal Vision Transformer.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Image Processing