Abstract

Large-scale pre-trained transformers have recently achieved remarkable success in several computer vision tasks. However, it remains highly challenging to fully fine-tune models for downstream tasks, due to the expensive computational and storage cost. Recently, Parameter-Efficient Tuning (PETuning) techniques, e.g., Visual Prompt Tuning (VPT), have significantly reduced the computation cost by inserting lightweight prompt modules including prompt tokens or adapter layers, into the pre-trained models and tuning these prompt modules with a small number of trainable parameters, while keeping the transformer backbone freeze. Although encouraging results were achieved, existing PETuning methods cannot perform well under the few-shot learning settings (i.e., extremely limited training data, with only 1 or 2 shots per class), due to the scarce supervision signal. To this end, we first empirically identify the poor performance is mainly due to the inappropriate way of initializing prompt modules, which has also been verified in the pre-trained language models. Next, we propose a Visual Pre-trained Prompt Tuning framework (VPPT), which pre-trains the prompt modules first and then leverages the pre-trained modules along with the pre-trained transformer backbone to perform prompt tuning on downstream tasks. Extensive experiments show that our VPPT framework achieves 16.08% average accuracy absolute improvement under 1 shot setting on five fine-grained visual classification datasets, compared with the previous PETuning techniques, e.g., VPT, in few-shot image classification.

Full Text

Published Version
Open DOI Link

Get access to 115M+ research papers

Discover from 40M+ Open access, 2M+ Pre-prints, 9.5M Topics and 32K+ Journals.

Sign Up Now! It's FREE

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call