In this paper, we propose a Motion Saliency based multi-stream Multiplier ResNets (MSM-ResNets) for action recognition. The proposed MSM-ResNets model consists of three interactive streams: the appearance stream, motion stream and motion saliency stream. Similar to conventional two-stream CNNs models, the appearance stream and motion stream are responsible for capturing the appearance information and motion information, respectively, while the motion saliency stream is responsible for capturing the salient motion information. In particular, to effectively utilize the spatiotemporal interactive information between different streams, the proposed MSM-ResNets model establishes interactive connections between different streams instead of fusing three streams at the final output layer. Two kinds of different multiplicative connections are injected, the first one is to inject multiplicative connections from the motion stream to the appearance stream, while the second one is to inject multiplicative connections from the motion saliency stream to the motion stream. Experimental results verify the effectiveness of the proposed MSM-ResNets on two standard action recognition datasets: UCF101 and HMDB51.
Read full abstract