Abstract

The high redundancy among keyframes is a critical issue for the existing summarizing methods in dealing with user-created videos. To address the critical issue, we present an unsupervised learning method, Spatial Attention Model guided Bi-directional Long Short-term Memory network (Bi-LSTM), on the combination of visual and semantic features. As for the visual feature, we design a Salient-Area- Size-based spatial attention model on the observation that humans tend to focus on sizable and moving objects in videos. Moreover, the Bi-LSTM network is leveraged to exploit the semantic feature. Afterward, the Soft Selected Probability generated from the spatial attention and semantic feature is fused to obtain the final probability for keyframe selection. The reinforcement learning framework, trained by the Deep Deterministic Policy Gradient algorithm, is adopted to do unsupervised training. Extensive experiments on the SumMe and TVSum datasets demonstrate that our method outperforms the state-of-the-art methods in terms of F-score.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.