Gate-Shift-Fuse for Video Action Recognition.

Swathikiran Sudhakaran,Sergio Escalera,Oswald Lanz

doi:10.1109/tpami.2023.3268134

Abstract

Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose input tensor and channel weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D CNNs to convert them into an efficient and high performing spatio-temporal feature extractor, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Gate-Shift-Fuse for Video Action Recognition.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence

Lead the way for us

Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence	Publication Date: Sep 1, 2023
Citations: 9

Similar Papers

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA
Junzhong Shen ... Mei Wen
-
Junzhong Shen, et. al.Junzhong Shen ... Mei Wen
15 Feb 2018
15 Feb 2018

F-E3D: FPGA-based Acceleration of an Efficient 3D Convolutional Neural Network for Human Action Recognition
Hongxiang Fan ... Shuanglong Liu
-
Hongxiang Fan, et. al.Hongxiang Fan ... Shuanglong Liu
01 Jul 2019
01 Jul 2019

Dissected 3D CNNs: Temporal skip connections for efficient online video processing
Okan Köpüklü ... Gerhard Rigoll
Computer Vision and Image Understanding | VOL. 215
Okan Köpüklü, et. al.Okan Köpüklü ... Gerhard Rigoll
01 Dec 2021
Computer Vision and Image Understanding | VOL. 215

3D-D2D: An Efficient Hybrid Model For Hyperspectral Image Classification
Sohan K R ... Satish Kumar Singh
-
Sohan K R, et. al.Sohan K R ... Satish Kumar Singh
25 Aug 2022
25 Aug 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Gate-Shift-Fuse for Video Action Recognition.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence