Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer

Zhenxiang Xiao,Lu Zhang,Xi Jiang,Chong Ma,Zhengliang Liu,Yixuan Yuan,Lin Zhao,Zihao Wu,Wei Liu,Dinggang Shen,Xiaowei Yu,Yuzhong Chen,Yi Pan,Junjie Yao,Xinyu Liu,Tianming Liu,Dajiang Zhu,Xiang Li,Dezhong Yao

doi:10.1016/j.inffus.2023.102204

Abstract

Prompts play a crucial role in enhancing the control, adaptability, and scalable application of large language models. In recent years, strategies involving prompts have also been applied to visual models. However, the extent to which the fusion of multi-modal prompts (e.g., text or image prompts) can improve downstream task performance in visual models has not been systematically investigated. To address this issue, this paper focuses on adapting the design of prompts based on instruction tuning in a vision transformer model for visual tasks, which we have named Instruction-ViT. The key idea involves implementing and fusing multi-modal prompts (either text or image prompts) related to category information, guiding the fine-tuning of the model. Based on the experiments conducted on several image understanding tasks, including classification, segmentation, image captioning, and object detection, we observe consistently improved performance and domain adaptability. Our work presents an innovative strategy for fusing multi-modal prompts, enhancing performance and adaptability in visual models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer

Abstract

Talk to us

Similar Papers

More From: Information Fusion

Lead the way for us

Journal: Information Fusion	Publication Date: Dec 18, 2023
Citations: 3

Similar Papers

Review on self-supervised image recognition using deep neural networks
Kriti Ohri ... Mukesh Kumar
Knowledge-Based Systems | VOL. 224
Kriti Ohri, et. al.Kriti Ohri ... Mukesh Kumar
29 Apr 2021
Knowledge-Based Systems | VOL. 224

The science of voice: A book on the singing and speaking voice based upon the latest research in physics andphysiology, with advice to those interested in talking movies and other mechanical reproducing devices, by Douglas Stanley, M.S. (New York University), A. C. G. I. (London University). With a Preface by H. H. Sheldon, Ph.D. (Professor of Physics at New York
T.K Cleveland
Journal of the Franklin Institute | VOL. 211
T.K ClevelandT.K Cleveland
01 Feb 1931
Journal of the Franklin Institute | VOL. 211

Exploring better image captioning with grid features
Jie Yan ... Yanming Guo
Complex & Intelligent Systems | VOL. 10
Jie Yan, et. al.Jie Yan ... Yanming Guo
10 Feb 2024
Complex & Intelligent Systems | VOL. 10

VSAM-Based Visual Keyword Generation for Image Caption
Suya Zhang ... Zeyu Chen
IEEE Access | VOL. 9
Suya Zhang, et. al.Suya Zhang ... Zeyu Chen
01 Jan 2020
IEEE Access | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer

Abstract

Talk to us

Similar Papers

More From: Information Fusion