Abstract

Video summarization still remains a challenging task. Due to sufficient video data on the Internet, such task draws significant attention in the vision community and benefits a wide range of applications, e.g., video retrieval, search, etc. To effectively perform video summarization by deriving the keyframes which represent the given input video, we propose a novel framework named Hierarchical Multi-Attention Network (H-MAN) which comprises the shot-level reconstruction model and multi-head attention model. While our designed attention model is two-stage hierarchical structure for producing various attention maps, we are among the first to utilize the multi-attention mechanism in the video summarization task, which brings improved performance. The quantitative and qualitative results demonstrate the effectiveness of our model, which performs favorably against state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call