Visual and Semantic Feature Coordinated Bi-Lstm Model for Unsupervised Video Summarization

Zhiqiang Hong,Rui Zhong

doi:10.1109/icme51207.2021.9428250

Abstract

While dealing with user-created video, the prior methods suffer from the problem of high redundancy among keyframes. To address the critical issue, we present a Visual and Semantic Feature coordinated Bi-LSTM (VSFB) model for unsupervised video summarization. First, a novel Salient-Area-Size-based spatial attention model is presented to extract frame-wise visual features on the observation that humans tend to focus on sizable and moving objects. Second, the visual features are integrated with semantic features processed by Bi-LSTM to refine the frame-wise probability of being selected as keyframes. Finally, an index adjusted diversity and representativeness reward is utilized to reinforce the learning operation of the VSFB model in the video summarization. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of the F-score.

Full Text