Abstract

The use of skeleton data for activity recognition has become prevalent due to its advantages over RGB data. A skeleton video includes frames showing two- or three-dimensional coordinates of human body joints. For recognizing an activity, not all the video frames are informative, and only a few key frames can well represent an activity. Moreover, not all joints participate in every activity; i.e., the key joints may vary across frames and activities. In this paper, we propose a novel framework for finding temporal and spatial attentions in a cooperative manner for activity recognition. The proposed method, which is called STH-DRL, consists of a temporal agent and a spatial agent. The temporal agent is responsible for finding the key frames, i.e., temporal hard attention finding, and the spatial agent attempts to find the key joints, i.e., spatial hard attention finding. We formulate the search problems as Markov decision processes and train both agents through interacting with each other using deep reinforcement learning. Experimental results on three widely used activity recognition benchmark datasets demonstrate the effectiveness of our proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call