Abstract

Human action recognition in videos is an important research topic in computer vision due to its wide applications. Actions naturally contain both spatial and temporal information. The key to action recognition is to model the spatial and temporal structures of actions. In this study, the authors propose an attention‐based spatial–temporal hierarchical convolutional long short‐term memory (ST‐HConvLSTM) network to model the structures of actions in the spatial and temporal domains. The ST‐HConvLSTM consists of two parts: a spatial–temporal attention module and a novel LSTM‐like architecture named hierarchical ConvLSTM (HConvLSTM). The HConvLSTM can model the spatial and temporal structures of actions. The spatial–temporal attention module can figure out which part of the video is more discriminative for action recognition and makes the HConvLSTM focus on it. In addition, a weighted fusion strategy is proposed to fuse the appearance information and motion information of the video. The proposed ST‐HConvLSTM is evaluated on UCF101, HMDB51 and Kinetics datasets. Experimental results show that the authors’ proposed ST‐HConvLSTM achieves state‐of‐the‐art performance compared with other recent LSTM‐like architectures and attention‐based methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call