SSAN: Separable Self-Attention Network for Video Representation Learning

Xudong Guo,Xun Guo,Yan Lu

doi:10.1109/cvpr46437.2021.01243

Abstract

Self-attention has been successfully applied to video representation learning due to the effectiveness of modeling long range dependencies. Existing approaches build the dependencies merely by computing the pairwise correlations along spatial and temporal dimensions simultaneously. However, spatial correlations and temporal correlations represent different contextual information of scenes and temporal reasoning. Intuitively, learning spatial contextual information first will benefit temporal modeling. In this paper, we propose a separable self-attention (SSA) module, which models spatial and temporal correlations sequentially, so that spatial contexts can be efficiently used in temporal modeling. By adding SSA module into 2D CNN, we build a SSA network (SSAN) for video representation learning. On the task of video action recognition, our approach outperforms state-of-the-art methods on Something-Something and Kinetics-400 datasets. Our models often outperform counterparts with shallower network and fewer modalities. We further verify the semantic learning ability of our method in visual-language task of video retrieval, which showcases the homogeneity of video representations and text embeddings. On MSR-VTT and Youcook2 datasets, video representations learnt by SSA significantly improve the state-of-the-art performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SSAN: Separable Self-Attention Network for Video Representation Learning

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Continuous frame motion sensitive self-supervised collaborative network for video representation learning
Shuai Bi ... Zhe Sun
Advanced Engineering Informatics | VOL. 56
Shuai Bi, et. al.Shuai Bi ... Zhe Sun
16 Mar 2023
Advanced Engineering Informatics | VOL. 56

Boosting Video Representation Learning with Multi-Faceted Integration
Zhaofan Qiu ... Ting Yao
-
Zhaofan Qiu, et. al.Zhaofan Qiu ... Ting Yao
01 Jun 2021
01 Jun 2021

The Clustered AGgregation (CAG) technique leveraging spatial and temporal correlations in wireless sensor networks
Sunhee Yoon ... Cyrus Shahabi
ACM Transactions on Sensor Networks | VOL. 3
Sunhee Yoon, et. al.Sunhee Yoon ... Cyrus Shahabi
01 Mar 2007
ACM Transactions on Sensor Networks | VOL. 3

Monitoring temporal trends in spatially structured populations: how should sampling effort be allocated between space and time?
Jonathan R Rhodes ... Niclas Jonzén
Ecography | VOL. 34
Jonathan R Rhodes, et. al.Jonathan R Rhodes ... Niclas Jonzén
05 May 2011
Ecography | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SSAN: Separable Self-Attention Network for Video Representation Learning

Abstract

Talk to us

Similar Papers