Abstract

Video summarization compresses a long video into some keyframes or a short video, which enables users to quickly obtain important contents in the video and improves the user experience. However, generic video summarization generates a single summarization result for each video, ignoring the user subjectivity. Query-focused video summarization introduces the user subjectivity in the form of user query into the summary process, and takes into account the importance of the frames/shots selected in the summarization and their relevance to the query jointly, which solves the above problem well. For query-focused video summarization, we propose a Regression Augmented Global Attention Network (RAGAN), which is mainly composed of a global attention module and a query-aware regression module. The global attention module takes advantage of the key concept of multiple computational steps (which is termed as “hops”) used in the memory network to continuously optimize the global attention information between hops. The query-aware regression module consists of a skip connection, a multi-modal feature fusion module and a two-layer fully connected network. The skip connection combines information from low and high layers. The multi-modal feature fusion module fuses visual features and textual features from three perspectives, i.e., the original information of each modality, the additive interaction between two modalities and the multiplicative interaction between two modalities. The two-layer fully connected network regresses the fused features to the importance scores to obtain the final video summarization result. Extensive experiments on the query-focused video summarization dataset demonstrate the effectiveness of our proposed RAGAN model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call