Global Temporal Difference Network for Action Recognition

Zhao Xie,Jiansong Chen,Dan Guo,Kewei Wu,Richang Hong

doi:10.1109/tmm.2022.3224327

Abstract

Temporal modeling still remains as a challenge for action recognition. Most existing temporal models focus on learning local variation between neighbor frames. There exists obvious deviations between local and global variations, such as subtle and notable motion variations. In this paper, we propose a global temporal difference module for action recognition, which consists of two sub-modules, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , a global aggregation module and a global difference module. These two sub-modules cooperate following the idea of using prior knowledge from the global view ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">i.e.</i> , global motion variation) to guide local learning at each moment. In the global aggregation module, the global prior knowledge is learned by aggregating the visual feature sequence of video into a global vector. In the global difference module, we prepare the difference vector sequence of video by subtracting each local vector from the global vector. Our method performs as a contextual guidance with a global view. The sequential dependency between these difference vectors is exploited with a channel-wise self-attention operation. Finally, the difference vectors at each timestamp are further used to enhance the semantics of the original local features. The enhanced features endow the action recognition has less deviation to understand the variation in the video globally. We instantiate the global temporal difference module into the ResNet block to form a global temporal difference network (GTDNet). Exhaustive experiments are conducted and our method achieves competitive performance at small FLOPs on Something-Something V1 & V2 and Kinetics-400.

Full Text