Abstract

Two-Stream framework with spatial information and optical flow information have reached the great performance for action recognition task in video. The optical flow information captures the low-level motion characteristics via a fixed quantity of consecutive video frames, which however contains noise information and is incompetent to characterize different actions with varying posture and duration. Usually ten frames before and after a frame are used as optical flow information, which may be too long or too short to capture the useful motion features for different actions. Moreover, the cost of calculating optical flow information from several consecutive video frames is high. To solve these issues, we propose a novel framework to recognize actions by capturing a high-level motion feature, human pose estimation, instead of the optical flow. Our framework uses 2D human pose estimation as the motion feature, and fuses it with the spatial information using attention mechanisms. We handle extensive experiments on two excellent and challenging datasets of realistic human action, HMDB-51 and UCF-101. The experimental results illustrate that our two-stream framework outperforms state-of-the-art approaches in terms of accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call