Hierarchical multimodal transformer to summarize videos

Bin Zhao,Maoguo Gong,Xuelong Li

doi:10.1016/j.neucom.2021.10.039

Bin Zhao, Maoguo Gong + Show 1 more

Open Access

https://doi.org/10.1016/j.neucom.2021.10.039

Copy DOI

Abstract

Although video summarization has achieved tremendous success benefiting from Recurrent Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi-hop relationships among video frames, which limits the performance. Transformer is an effective model to deal with this problem, and surpasses RNN-based methods in several sequence modeling tasks, such as machine translation, video captioning, etc. Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization, which can capture the dependencies among frame and shots, and summarize the video by exploiting the scene information formed by shots. Furthermore, we argue that both the audio and visual information are essential for the video summarization task. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. In this paper, the proposed method is denoted as Hierarchical Multimodal Transformer (HMT). Practically, extensive experiments show that HMT achieves (F-measure: 0.441, Kendall’s τ: 0.079, Spearman’s ρ: 0.080) and (F-measure: 0.601, Kendall’s τ: 0.096, Spearman’s ρ: 0.107) on SumMe and TVsum, respectively. It surpasses most of the traditional, RNN-based and attention-based video summarization methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Hierarchical multimodal transformer to summarize videos

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Journal: Neurocomputing	Publication Date: Oct 22, 2021
Citations: 45

Similar Papers

Shifted-Window Hierarchical Vision Transformer for Distracted Driver Detection
Hong Vin Koay ... Chee-Onn Chow
-
Hong Vin Koay, et. al.Hong Vin Koay ... Chee-Onn Chow
23 Aug 2021
23 Aug 2021

Wildlife Video Captioning Based on ResNet and LSTM
Abid Kapadi ... Chinmay Ram Kavimandan
-
Abid Kapadi, et. al.Abid Kapadi ... Chinmay Ram Kavimandan
01 Jan 2020
01 Jan 2020

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning
Amaia Salvador ... Loris Bazzani
-
Amaia Salvador, et. al.Amaia Salvador ... Loris Bazzani
01 Jun 2021
01 Jun 2021

Unsupervised video summarization using deep Non-Local video summarization networks
Sha-Sha Zang ... Ru Zeng
Neurocomputing | VOL. 519
Sha-Sha Zang, et. al.Sha-Sha Zang ... Ru Zeng
12 Nov 2022
Neurocomputing | VOL. 519

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hierarchical multimodal transformer to summarize videos

Abstract

Talk to us

Similar Papers

More From: Neurocomputing