Abstract
Action recognition is an important yet challenging task in computer vision. Attention mechanism not only tells where to focus but when to focus. It plays a key role in extracting discriminative spatial and temporal features for solving the task. In this paper, we propose an improved spatiotemporal attention model based on the two-stream structure to recognize the different actions in videos. Specifically, we first extract the intra-frame spatial features and inter-frame optical flow features for each video data. Then we implement an effective attention module, which sequentially infers attention maps along three separate dimensions: channel, spatial and temporal. After adaptive feature refinement based on the attention maps, we perform a temporal pooling process to squeeze the temporal dimension. Then, these achieved spatial and temporal features are fed into the spatial LSTM and temporal LSTM, respectively. Finally, we fuse the spatial feature, temporal feature and two-stream fusion feature to classify the actions in videos. Additionally, we also collect and construct a new Ping-Pong action dataset for subsequent human-robot interaction task from YouTube. It contains 2400 labeled videos for 4 categories. We compare with other action recognition algorithms and validate the feasibility and effectiveness of the proposed method on Ping-Pong action dataset and HMDB51 dataset.
Highlights
Video action recognition aims to predict the action type labels from videos, and it has drawn increasing attention considering its potential applications in many fields, e.g., assisted life, human-robot interaction or intelligent video surveillance
The main contributions of this work are summarized as follows: 1) We propose a deep learning framework for video action recognition which can explicitly capture spatiotemporal information based on three-dimension attention module
MODEL ARCHITECTURE This paper proposes an iCBAM-based spatiotemporal-stream model for action recognition in videos
Summary
Video action recognition aims to predict the action type labels from videos, and it has drawn increasing attention considering its potential applications in many fields, e.g., assisted life, human-robot interaction or intelligent video surveillance. Zang et al [8] suggest utilizing attention-based temporal weighted CNN to learn action features All these methods are trying to find an effective model that can identify distinguished spatiotemporal information of video data. Motivated by all the above, we propose a three-dimension attention-based spatiotemporal-stream model for the action recognition task The aim of this model is selectively considering the information over the spatial, channel and temporal features simultaneously. The main contributions of this work are summarized as follows: 1) We propose a deep learning framework for video action recognition which can explicitly capture spatiotemporal information based on three-dimension attention module. We will introduce each module sequentially based on the proposed network architecture (Fig.1) They are the Network Inputs, Improved Attention Mechanism (iCBAM), Temporal Segment, Temporal and Spatial LSTMs. A. We will discuss the experimental details and results of the proposed method for action recognition in part
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.