Abstract

Long short-term memory (LSTM) networks are widely used to handle temporal or sequential data, and have great potential for video recognition. Existing LSTM-based video recognition methods either insert LSTM modules at the end of 2D convolutional neural networks (CNNs), called global LSTM methods, or build networks solely by stacking multiple LSTM modules. Unfortunately, these LSTM-based video recognition methods are not competitive, compared to state-of-the-art 3D CNNs or two-stream CNNs. In order to fully explore the potential of LSTM, this paper rethinks its role in video recognition network architectures and proposes a novel Temporal Grafter Network (TGN). Specifically, we develop an efficient and effective variant of convolutional LSTM module, which is grafted between different stages of very deep 2D CNNs for temporal modeling and delivery. Our TGN can capture local motion patterns of varying scales inherent in feature maps from high to low resolutions, while attending to the spatial context information and modeling global temporal dependency across the whole video. The proposed TGN can capture and transmit temporal information throughout very deep 2D CNNs, overcoming the downsides of existing LSTM-based methods and being able to make full use of the potential of LSTM for effective video recognition and early action recognition. We perform extensive ablation study to verify the effectiveness of our proposed methods, and experiments on three widely used video benchmarks show that our methods can achieve performance matching or better than the state-of-the-arts.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call