Abstract

With the explosive growth of videos captured everyday, how to efficiently extract useful information from videos has become a more and more important problem. As one of the most effective methods, video summarization aiming to extract the most important frames or shots has attracted more interests recently. Currently, lots of methods employ a recurrent structure. However, due to its step-by-step characteristic, it is difficult to parallelize these models. To address this problem, we propose a dual-path attentive video summarization framework consisting of a temporal spatial encoder, a score-aware encoder and a decoder. And all of them are mainly based on multi-head self-attention and convolutional block attention module. The temporal spatial encoder is to capture the temporal and spatial information while the score-aware encoder incorporates the appearance features with previously predicted frame-level importance scores. By combining the scores and appearance features, our model can better capture the long-range global dependencies and update the importance scores of previous frames continuously. Moreover, entirely based on attention mechanism, our model can be trained in full parallel, which leads to less training time. To validate the method, we employ the two popular datasets SumMe and TVSum. The experimental results show the effectiveness of the proposed method.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call