Abstract

Recent progresses in video action recognition are driven by the great success of convolutional neural networks (ConvNets). Many researches have investigated modeling spatiotemporal architecture with ConvNets to take full advantage of video information. However, the quality of video representations that serve as the network inputs, is often insufficient due to significant background motion. In this paper, we propose an Improved Dynamic Image (IDI) to describe videos by applying salient object detection and rank pooling on sequence of still images. Saliency detection aims to segment objects of interest which can be viewed as image-level background motion removal, while rank pooling is used to construct dynamic images as input to ConvNets. We also present a temporal Residual Network (ResNet) architecture which directly operates on multiple IDIs for long-term video representations learning. Experiments on two standard action recognition benchmarks demonstrate that our method achieves state-of-the-art performance without combining video-level representations (e.g. dense optical flow) as well as hand-designed features (e.g. improved dense trajectories).

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.