A movie summarization model can automatically edit a condensed version of a movie by selecting keyframes. Some previous works have proposed some movie summarizers based on traditional methods or recent neural networks and achieved some progress. Despite the demonstrated successes, there are some limitations: (1) previous works mainly resort to hand-crafted heuristics and most of them are unsupervised; (2) currently there is no publicly suitable dataset available for the supervised movie summarization; (3) existing works only focus on the movies themselves while neglecting the audiences, who have the most to say in which part of the movie is more attractive. To break through the aforementioned limitations, we establish a movie summarization dataset Movie50 and propose a novel human attention based annotation pipeline. Furthermore, we propose the A/V-MSNet, an audiovisual neural network that takes advantage of spatio-temporal visual and auditory information to better simulate human attention as well as exploit more plentiful information. The network is designed, trained end-to-end, and evaluated on the public dataset and our dataset. Extensive experiments demonstrate the superiority of the proposed method.
Read full abstract