OPT-STVIT: Video Recognition through Optimized Spatial-Temporal Video Vision Transformers

Dr Divya Nimma,Arjun Uddagiri

doi:10.70135/seejph.vi.2341

Abstract

In this paper, we address the computational chal- lenges associated with video recognition tasks, where video transformers have shown impressive results but come with high computational costs. We introduce Opt-STViT, a token selection framework that dynamically chooses a subset of informative tokens in both temporal and spatial dimensions based on the input video samples. Specifically, we frame token selection as a ranking problem, leveraging a lightweight scorer network to estimate the importance of each token. Only tokens with top scores are retained for downstream processing. In the temporal dimension, we identify and keep the frames most relevant to the action categories, while in the spatial dimension, we pinpoint the most discriminative regions in feature maps without affecting the spatial context used hierarchically in most video transformers. To enable end-to-end training despite the non-differentiable nature of token selection, we employ a perturbed-maximum-based dif- ferentiable Top-K operator. Our extensive experiments, primar- ily conducted on the Kinetics-400 and something-something-V2 datasets using the recently introduced MViT video transformer backbone, demonstrate that our framework achieves similar results while requiring 20 percent less computational resources. We also establish the versatility of our approach across different transformer architectures and video datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

OPT-STVIT: Video Recognition through Optimized Spatial-Temporal Video Vision Transformers

Abstract

Talk to us

Similar Papers

More From: South Eastern European Journal of Public Health

Lead the way for us

Journal: South Eastern European Journal of Public Health	Publication Date: Nov 21, 2024
License type: CC BY-ND 4.0

Similar Papers

Efficient Video Transformers with Spatial-Temporal Token Selection
Junke Wang ... Zuxuan Wu
-
Junke Wang, et. al.Junke Wang ... Zuxuan Wu
01 Jan 2021
01 Jan 2021

STST: Spatial-Temporal Specialized Transformer for Skeleton-based Action Recognition
Yuhan Zhang ... Wen Li
-
Yuhan Zhang, et. al.Yuhan Zhang ... Wen Li
17 Oct 2021
17 Oct 2021

Construction of Multiple Paths for the Living Protection and Utilization of Traditional Villages: A Case Study of the Zhoutie Traditional Village in the Taihu Lake Area
Jinxiu Wu ... Xiaodong Xu
Journal of South Architecture | VOL. 1
Jinxiu Wu, et. al.Jinxiu Wu ... Xiaodong Xu
18 Jun 2024
Journal of South Architecture | VOL. 1

The Interaction Between Temporal and Spatial Information in the Updating of Situation Model
Xianyou He ... Guangyao Chen
Acta Psychologica Sinica | VOL. 45
Xianyou He, et. al.Xianyou He ... Guangyao Chen
27 Nov 2013
Acta Psychologica Sinica | VOL. 45

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

OPT-STVIT: Video Recognition through Optimized Spatial-Temporal Video Vision Transformers

Abstract

Talk to us

Similar Papers

More From: South Eastern European Journal of Public Health