Abstract

Video scene detection is the task of temporally segmenting a video into its basic story units called scenes. We propose a temporal context aware scene detection method. For each shot in a video, we store the time-indexed features of its surrounding shots as its context memory. A context-reading operation is performed to read the most relevant information from the memory which is used to update the feature of the query shot. To adaptively determine the temporal scale of context memory for different queries, we apply a bank of context memories of different temporal scales to generate multiple context reads, and adaptively aggregate them according to their confidence scores. The adaptive context-reading is guided by a structure learning objective which encourages each shot to read the most appropriate context such that the global structure of scene can be revealed in the feature space. With the context-aware shot features learned by our method, we perform clustering to find the scene boundaries. Our experiments demonstrate that adaptively modeling temporal context yields the state-of-the-art results on the existing video scene detection datasets. We also construct a large-scale dataset for the task and our ablation studies on it show that the performance gains owe to the proposed adaptive context reading.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call