Abstract

In this paper, we propose a multiscale hierarchical attention approach for supervised video summarization. Different from most existing supervised methods which employ bidirectional long short-term memory networks, our method exploits the underlying hierarchical structure of video sequences and learns both the short-range and long-range temporal representations via a intra-block and a inter-block attention. Specifically, we first separate each video sequence into blocks of equal length and employ the intra-block and inter-block attention to learn local and global information, respectively. Then, we integrate the frame-level, block-level, and video-level representations for the frame-level importance score prediction. Next, we conduct shot segmentation and compute shot-level importance scores. Finally, we perform key shot selection to produce video summaries. Moreover, we extend our method into a two-stream framework, where appearance and motion information is leveraged. Experimental results on the SumMe and TVSum datasets validate the effectiveness of our method against state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.