Abstract

Recent years have witnessed the popularity of using recurrent neural network (RNN) for action recognition in videos. However, videos are of high dimensionality and contain rich human dynamics with various motion scales, which makes the traditional RNNs difficult to capture complex action information. In this paper, we propose a novel recurrent spatial-temporal attention network (RSTAN) to address this challenge, where we introduce a spatial-temporal attention mechanism to adaptively identify key features from the global video context for every time-step prediction of RNN. More specifically, we make three main contributions from the following aspects. First, we reinforce the classical long short-term memory (LSTM) with a novel spatial-temporal attention module. At each time step, our module can automatically learn a spatial-temporal action representation from all sampled video frames, which is compact and highly relevant to the prediction at the current step. Second, we design an attention-driven appearance-motion fusion strategy to integrate appearance and motion LSTMs into a unified framework, where LSTMs with their spatial-temporal attention modules in two streams can be jointly trained in an end-to-end fashion. Third, we develop actor-attention regularization for RSTAN, which can guide our attention mechanism to focus on the important action regions around actors. We evaluate the proposed RSTAN on the benchmark UCF101, HMDB51 and JHMDB data sets. The experimental results show that, our RSTAN outperforms other recent RNN-based approaches on UCF101 and HMDB51 as well as achieves the state-of-the-art on JHMDB.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call