Abstract

Video summarization aims to condense video content by extracting pivotal frames or shots. Most existing methods focus on maximizing the intersection between predicted summary and ground truth, overlooking whether users can infer the content of the original video from the summary. Additionally, these approaches heavily rely on annotated data, posing limitations.Therefore, we propose a reconstructive network under contrastive graph rewards for video summarization, comprising a summary generator and a video reconstructor. The summary generator employs graph contrastive learning to distill essential video information to generate summary. Meanwhile, the video reconstructor employs reinforcement learning within an unsupervised training framework to optimize the summary generator, addressing the shortage of annotated video data in summarization tasks. Leveraging reconstruction loss, our approach ensures that predicted summary encapsulate main video content and inter-shots dependencies. Notably, we innovatively devise a mutual information maximization reconstruction reward function to preserve shared information between the summary and the original video, facilitating users in comprehending the original video content. We conduct massive experiments on the TVSum and SumMe datasets, and our network achieved F1 scores of 58.8% and 48.0%, respectively. Experimental results validate the superiority of our method over both state-of-the-art unsupervised and many supervised video summarization techniques.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call