Unsupervised Learning of Visual and Semantic Features for Video Summarization

Yansen Huang,Wenjin Yao,Rui Zhong,Rui Wang

doi:10.1109/iscas51556.2021.9401310

Abstract

The high redundancy among keyframes is a critical issue for the existing summarizing methods in dealing with user-created videos. To address the critical issue, we present an unsupervised learning method, Spatial Attention Model guided Bi-directional Long Short-term Memory network (Bi-LSTM), on the combination of visual and semantic features. As for the visual feature, we design a Salient-Area- Size-based spatial attention model on the observation that humans tend to focus on sizable and moving objects in videos. Moreover, the Bi-LSTM network is leveraged to exploit the semantic feature. Afterward, the Soft Selected Probability generated from the spatial attention and semantic feature is fused to obtain the final probability for keyframe selection. The reinforcement learning framework, trained by the Deep Deterministic Policy Gradient algorithm, is adopted to do unsupervised training. Extensive experiments on the SumMe and TVSum datasets demonstrate that our method outperforms the state-of-the-art methods in terms of F-score.

Full Text