MAR: Masked Autoencoders for Efficient Action Recognition

Zhiwu Qing,Nong Sang,Shiwei Zhang,Yuehuan Wang,Yiliang Lv,Xiang Wang,Ziyuan Huang,Changxin Gao

doi:10.1109/tmm.2023.3263288

Abstract

Standard approaches for video action recognition usually operate on full input videos, which is inefficient due to the widespread spatio-temporal redundancy in videos. The recent progress in masked video modelling, specifically VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts using limited visual content. Inspired by this, we propose Masked Action Recognition (MAR), which reduces redundant computation by discarding a proportion of patches and operating only on a portion of the videos. MAR includes two essential components: <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">cell running masking</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">bridging classifier</i> . Specifically, to enable the ViT to perceive the details beyond the visible patches, cell running masking is used to preserve the spatio-temporal correlations in videos. This ensures that the patches at the same spatial location can be observed in turn for easy reconstructions. Additionally, we notice that, although the partially observed features can reconstruct semantically explicit invisible patches, they fail to achieve accurate classification. To address this issue, we propose a bridging classifier that can help fill the semantic gap between the ViT encoded features used for reconstruction and the specialized features used for classification. Our proposed MAR can reduce the computational cost of ViT by 53%. Extensive experiments have demonstrated that MAR consistently outperforms existing ViT models by a notable margin. Notably, we found that a ViT-Large model fine-tuned by MAR achieves comparable performance to a ViT-Huge model fine-tuned by standard training methods on both Kinetics-400 and Something-Something v2 datasets. Moreover, the computation overhead of our ViT-Large model is only 14.5% of that of the ViT-Huge model. Codes have been made available <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/alibaba-mmai-research/Masked-Action-Recognition</uri> .

Full Text