Abstract
Modeling spatiotemporal representations is one of the most essential yet challenging issues in video action recognition. Existing methods lack the capacity to accurately model either the correlations between spatial and temporal features or the global temporal dependencies. Inspired by the two-stream network for video action recognition, we propose an encoder–decoder framework named Two-Stream Bidirectional Long Short-Term Memory (LSTM) Residual Network (TBRNet) which takes advantage of the interaction between spatiotemporal representations and global temporal dependencies. In the encoding phase, the two-stream architecture, based on the proposed Residual Convolutional 3D (Res-C3D) network, extracts features with residual connections inserted between the two pathways, and then the features are fused to become the short-term spatiotemporal features of the encoder. In the decoding phase, those short-term spatiotemporal features are first fed into a temporal attention-based bidirectional LSTM (BiLSTM) network to obtain long-term bidirectional attention-pooling dependencies. Subsequently, those temporal dependencies are integrated with short-term spatiotemporal features to obtain global spatiotemporal relationships. On two benchmark datasets, UCF101 and HMDB51, we verified the effectiveness of our proposed TBRNet by a series of experiments, and it achieved competitive or even better results compared with existing state-of-the-art approaches.
Highlights
With the rapid development of the mobile Internet and the continuous updating of video capture devices, the number of video resources is growing explosively
In the encoding phase,to the network that extracts appearance motion features separately, our proposed inoriginal contrasttwo-stream to the original two-stream network that and extracts appearance and motion features two-stream encoder consisted of a spatial appearance stream and a temporal motion stream with separately, our proposed two-stream encoder consisted of a spatial appearance stream and a multiplicative residual connections inserted between the two pathways
Long-term recurrent convolutional network (LRCN) [6] is an end-to-end framework that classifies the action in video sequences using Long Short-Term Memory (LSTM) with features abstracted by convolutional neural networks (CNNs)
Summary
With the rapid development of the mobile Internet and the continuous updating of video capture devices, the number of video resources is growing explosively. In the encoding phase,to the network that extracts appearance motion features separately, our proposed inoriginal contrasttwo-stream to the original two-stream network that and extracts appearance and motion features two-stream encoder consisted of a spatial appearance stream and a temporal motion stream with separately, our proposed two-stream encoder consisted of a spatial appearance stream and a multiplicative residual connections inserted between the two pathways. We accurately modeled the interactions between spatial and temporal features using proposed a two-stream encoder with cross-stream residual connections which were benefit for a two-stream encoder with cross-stream residual connections which were benefit for backpropagation of gradients; backpropagation of gradients; We effectively captured the global spatiotemporal dependencies by incorporating the local. The rest of the paper is organized as follows: In Section 2, related works of ours are briefly reviewed
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have