Abstract
This paper studies the video summarization task by formulating it as a sequential decision-making process, in which the input is a sequence of video frames and the output is a subset of the original frames. Long Short-Term Memory (LSTM) is a commonly used framework of prior video summarization methods due to great temporal dependencies modeling ability. However, the frame sequence in the video summarization task is relatively long, and LSTM can only handle short video clips of up to 80 frames in length. This paper proposes a novel deep summarization framework named Deep Hierarchical LSTM Networks with Attention for Video Summarization (DHAVS) that consists of delicate feature extraction, temporal dependencies modeling, and video summary generation. Specifically, we employ 3D CNN instead of 2D CNN to extract spatial–temporal features and design an attention-based hierarchical LSTM module to capture the temporal dependencies among video frames. Additionally, we treat video summarization as an imbalanced class distribution problem and design a cost-sensitive loss function. Experimental results show that the proposed method has 0.7% ∼ 21.3% and 4.5% ∼ 12.2% improved to the conventional methods on SumMe and TVSum datasets. • Proposing a novel hierarchical LSTM network with attention for video summarization. • Employing 3D ResNeXt-101 to extract more delicate video representation. • Capturing temporal dependencies by hierarchical LSTM networks with attention. • Introducing cost-sensitive learning to solve imbalanced class problem.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have