Abstract
It is crucial to sample a small portion of relevant frames for efficient video classification. The existing methods mainly develop hand-designed sampling strategies or learn sequential selection policies. However, there are two challenges to be solved. First, hand-designed sampling strategies are intrinsically non-adaptive to different video backbones. Second, sequential frame selection policies ignore temporal relations among all video frames. The sequential selection process also hinders the application of these video samplers in speed-critical systems. In this article, we propose a differentiable parallel video sampling network (PSN) to tackle the aforementioned challenges, First, we optimize the video sampler with a differentiable surrogate loss, allowing to dynamically learn the sampler with the cooperation from the video classification model. Our sampler considers the feedback from all frames jointly, eliminating the learning difficulties of sequential decision making. The learning process is fully gradient-based, making the sampler be learned efficiently. Our video sampler can assess a set of frames swiftly and determine the importance of each frame in parallel. Second, we propose to model the inter-relation among contextual frames, which encourages the sampler to select frames based on a comprehensive inspection of the entire video. We observe that a simple context relation mining instantiation would significantly improve the classification performance. The experimental results on three standard video recognition benchmarks demonstrate the efficacy and efficiency of our framework.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have