Abstract

A key factor that makes action detection in videos different from general video classification is human-guided clues, especially motion signals. Since not all the pixels in a video are informative for action recognition, the irrelevant and redundant parts can lead to a lot of noise and be burdensome for both feature extraction and classifier training. This encourages the researchers to seek out the design of the attentive model that can dynamically focus computations on the key spatiotemporal volumes. In this paper, we propose a motion-centric attention model for action detection in videos which imitates the human perception of saccade and fixation procedures while detecting actions in a video. Specifically, we first present a strategy to generate motion-centric locations based on the density peak of motion signals, providing reliable candidates around which actions have high possibilities to occur. Then, we introduce an attention model that conducts the saccade and fixation procedures on these candidates to observe local spatiotemporal visual information, preserve internal comprehension, and produce the action proposals on temporal bounds. Afterward, a classifier with several variants is prepared to classify the action proposals and decide which one to fixate and generate the final predictions. We show how to efficiently train our model to produce fast and accurate action detection, by scanning only a small fraction of locations in a video. The extensive experiments on three challenging datasets show promising results with both accuracy and speed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call