Abstract

Given a video clip that contains only one type of action (e.g., golfing), the goal of action recognition is to recognize this action category from a given set of action types. To deliver fast response for practical video applications, existing works have been endevouring on processing the leading frames of the input video. In our view, only the informative key frames extracted from this ‘partial video’ should be used for performing action recognition task. This will not only further speed up action recognition process due to less amount of data to be processed but also achieve higher recognition accuracy owing to more distinctive features presented to the learning network. For that, a novel a two-stage learning network architecture is proposed in this paper that consists of a selection network (S-net) and a recognition network (R-net). The S-net is a relatively-shallow network designed to efficiently identify informative key frames, while the R-net is a deep network to perform the final action recognition. In the S-net, a key frame selection criterion is further proposed for identifying informative key frames. Extensive experiments based on two benchmark datasets, UCF101 and HMDB51, have been conducted and clearly shown that our approach significantly outperforms existing state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call