Spatial and temporal saliency based four-stream network with multi-task learning for action recognition

Ming Zong,Ruili Wang,Yujun Ma,Wanting Ji

doi:10.1016/j.asoc.2022.109884

Abstract

Action recognition is a challenging video understanding task for the following two reasons: (i) the complex video background impairs the recognition of desirable actions, and (ii) the fusion of spatial information and temporal information. In this paper, we proposed a novel spatial and temporal saliency based four-stream network with multi-task learning. The proposed model comprises four streams: an appearance stream (i.e. a spatial stream), a motion stream (i.e. a temporal stream), a novel spatial saliency stream and a novel temporal saliency stream. The spatial stream captures the global spatial information from videos using the sampled RGB video frames as the input. The temporal stream captures the global motion information of each pixel using the sampled stacked optical flow frames as the input. The novel spatial saliency stream is used to acquire spatial saliency information from spatial saliency frames, and the novel temporal saliency stream is used to acquire temporal saliency information from temporal saliency frames. In addition, based on the four streams, multi-task learning based LSTM is adopted, which can share the complementary knowledge between different CNN features extracted from different stacked frames. The multi-task learning based LSTM can capture long-term dependency relationships between the consecutive frames over temporal evolution, which take full advantage of CNNs and LSTMs. We conduct experiments on three popular video action recognition datasets, including the UCF101 action dataset, the HMDB51 action dataset and the large-scale Kinetics action dataset, to verify the effectiveness of the proposed network, and the results demonstrate that the proposed network achieves better performance than the state-of-the-art methods on these action recognition datasets.

Full Text