Abstract

In this manuscript, the authors present a keyshots-based supervised video summarization method, where feature fusion and LSTM networks are used for summarization. The framework can be divided into three folds: 1) The authors formulate video summarization as a sequence to sequence problem, which should predict the importance score of video content based on video feature sequence. 2) By simultaneously considering visual features and textual features, the authors present the deep fusion multimodal features and summarize videos based on recurrent encoder-decoder architecture with bi-directional LSTM. 3) Most importantly, in order to train the supervised video summarization framework, the authors adopt the number of users who decided to select current video clip in their final video summary as the importance scores and ground truth. Comparisons are performed with the state-of-the-art methods and different variants of FLSum and T-FLSum. The results of F-score and rank correlation coefficients on TVSum and SumMe shows the outstanding performance of the method proposed in this manuscript.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call