Abstract

Video summarization is one of the critical techniques in video retrieval, video browsing, and management. It is still a challenging research task due to user subjectivity, excessive redundant information, and lack of spatio-temporal dependency. In this paper, we propose an unsupervised video summarization approach via reinforcement learning with shot-level semantics. The primary idea of this unsupervised method is based on the encoder-decoder model. We use a novel field size dataset to train a convolutional neural network as an encoder to extract the convolutional feature matrix from the video. Then, a bidirectional LSTM is utilized as a decoder to obtain probability weights for selecting keyframes, which preserves the spatio-temporal dependence of video summarization. Specifically, to reduce the influence of user subjectivity, we design a shot-level semantic reward function to generate more representative summarization results. The shot-level semantics are the rules followed by the video shooting process without being changed by the preferences of different viewers. Finally, we evaluate our approach on four classical datasets, SumMe, TVSum, CoSum, and VTW. The results suggest that our algorithm outperforms others and achieves satisfactory results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call