Video summarization extracts the relevant contents from a video and presents the entire content of the video in a compact and summarized form. User based video summarization, can summarize a video as per the requirement of the user. In this work, a non interactive and a perception-based video summarization technique is proposed that makes use of attention mechanism to capture user’s interest and extract relevant keyshots in temporal sequence from the video content. Here, video summarization has been articulated as a sequence-to-sequence learning problem and a supervised method has been proposed for summarization of the video. Adding layers to the existing network makes it deeper, enables higher level of abstraction and facilitates better feature extraction. Therefore, the proposed model uses a multi-layered, deep summarization encoder-decoder network (MLAVS), with attention mechanism to select final keyshots from the video. The contextual information of the video frames is encoded using a multi-layered Bidirectional Long Short-Term Memory network (BiLSTM) as the encoder. To decode, a multi-layered attention-based Long Short-Term memory (LSTM) using a multiplicative score function is employed. The experiments are performed on the benchmark TVSum dataset and the results obtained are compared with recent works. The results show considerable improvement and clearly demonstrate the efficacy of this methodology against most of the other available state-of-art methods.
Read full abstract