Abstract
A movie summarization model can automatically edit a condensed version of a movie by selecting keyframes. Some previous works have proposed some movie summarizers based on traditional methods or recent neural networks and achieved some progress. Despite the demonstrated successes, there are some limitations: (1) previous works mainly resort to hand-crafted heuristics and most of them are unsupervised; (2) currently there is no publicly suitable dataset available for the supervised movie summarization; (3) existing works only focus on the movies themselves while neglecting the audiences, who have the most to say in which part of the movie is more attractive. To break through the aforementioned limitations, we establish a movie summarization dataset Movie50 and propose a novel human attention based annotation pipeline. Furthermore, we propose the A/V-MSNet, an audiovisual neural network that takes advantage of spatio-temporal visual and auditory information to better simulate human attention as well as exploit more plentiful information. The network is designed, trained end-to-end, and evaluated on the public dataset and our dataset. Extensive experiments demonstrate the superiority of the proposed method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.