Abstract

Query-conditioned video summarization requires to (1) find a diverse set of video shots/frames that are representative for the whole video, and that (2) the selected shots/frames are related to a given query. Thus it can be tailored to different user interests leading to a better personalized summary and differs from the generic video summarization which only focuses on video content. Our work targets this query-conditioned video summarization task, by first proposing a Mapping Network (MapNet) in order to express how related a shot is to a given query. MapNet helps establish the relation between the two different modalities (videos and query), which allows mapping of visual information to query space. After that, a deep reinforcement learning-based summarization network (SummNet) is developed to provide personalized summaries by integrating relatedness, representativeness and diversity rewards. These rewards jointly guide the agent to select the most representative and diversity video shots that are most related to the user query. Experimental results on a query-conditioned video summarization benchmark demonstrate the effectiveness of our proposed method, indicating the usefulness of the proposed mapping mechanism as well as the reinforcement learning approach.

Highlights

  • It is estimated that by 2021 it will take a person about 5 million years to watch all the videos that are uploaded each month [1]

  • Different from traditional video summarization which only focuses on video content, query-conditioned video summarization is tasked to generate user-oriented summaries conditioned on a given query in the form of text [4,12,13,14,15]

  • We introduce a notion of relatedness that expresses how related a video shot is to a given query by proposing a mapping network (MapNet) that maps video shots to query space

Read more

Summary

Introduction

It is estimated that by 2021 it will take a person about 5 million years to watch all the videos that are uploaded each month [1]. Video summarization helps to address this problem and aims to automatically select a small subset of the frames/shots that captures the most interesting parts in a concise manner [2,3,4,5,6,7]. It reduces the time and cost required to analyze video information, and provides significant advantages for efficient video browsing, searching and understanding, and further enhances various down-streaming applications such as video-based question answering [8], robot learning [9] and surveillance data analysis [10,11].

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.