M-adapter: Multi-level image-to-video adaptation for video action recognition

Rongchang Li,Tianyang Xu,Xiao-Jun Wu,Linze Li,Xiao Yang,Zhongwei Shen,Josef Kittler

doi:10.1016/j.cviu.2024.104150

Abstract

With the growing size of visual foundation models, training video models from scratch has become costly and challenging. Recent attempts focus on transferring frozen pre-trained Image Models (PIMs) to video fields by tuning inserted learnable parameters such as adapters and prompts. However, these methods require saving PIM activations for gradient calculations, leading to limited savings of GPU memory. In this paper, we propose a novel parallel branch that adapts the multi-level outputs of the frozen PIM for action recognition. It avoids passing gradients through the PIMs, thus naturally owning much lower GPU memory footprints. The proposed adaptation branch consists of hierarchically combined multi-level output adapters (M-adapters), comprising a fusion module and a temporal module. This design digests the existing discrepancies between the pre-training task and the target task with lower training costs. We show that when using larger models or on scenarios with higher demands for temporal modelling, the proposed method performs better than those with the full-parameter tuning manner. Finally, despite only tuning fewer parameters, our method achieves superior or comparable performance against current state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

M-adapter: Multi-level image-to-video adaptation for video action recognition

Abstract

Talk to us

Similar Papers

More From: Computer Vision and Image Understanding

Lead the way for us

Similar Papers

CodeEditor : Learning to Edit Source Code with Pre-trained Models
Jia Li ... Zhi Jin
ACM Transactions on Software Engineering and Methodology | VOL. 32
Jia Li, et. al.Jia Li ... Zhi Jin
30 Sep 2023
ACM Transactions on Software Engineering and Methodology | VOL. 32

Measuring Task Similarity and Its Implication in Fine-Tuning Graph Neural Networks
Renhong Huang ... Chenglu Pan
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Renhong Huang, et. al.Renhong Huang ... Chenglu Pan
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Different Strokes for Different Folks: Investigating Appropriate Further Pre-training Approaches for Diverse Dialogue Tasks
Yao Qiu ... Jinchao Zhang
-
Yao Qiu, et. al.Yao Qiu ... Jinchao Zhang
01 Jan 2020
01 Jan 2020

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
Matthew E Peters ... Sebastian Ruder
-
Matthew E Peters, et. al.Matthew E Peters ... Sebastian Ruder
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

M-adapter: Multi-level image-to-video adaptation for video action recognition

Abstract

Talk to us

Similar Papers

More From: Computer Vision and Image Understanding