What is the limitation of multimodal LLMs? A deeper look into multimodal LLMs through prompt probing

Shuhan Qi,Zhengying Cao,Jun Rao,Lei Wang,Jing Xiao,Xuan Wang

doi:10.1016/j.ipm.2023.103510

Abstract

Large language models (LLMs) are believed to contain vast knowledge. Many works have extended LLMs to multimodal models and applied them to various multimodal downstream tasks with a unified model structure using prompt. Appropriate prompts can stimulate the knowledge capabilities of the model to solve different tasks. However, how the content of the prompts affects the model’s understanding of the information is still under-explored in the literature. We fill this gap by offering a systematic study on prompt probing for multimodal LLMs, examining various factors for their understanding of prompts. To achieve this goal, we propose a novel prompt probing framework that starts with the input and designs three types of input change strategies as templates for probing: visual prompt, text prompt and extra knowledge prompt. Our extensive experiments on the VQA dataset show that existing multimodal LLMs do not understand the input content but more simply fit the training data distribution. Current multimodal models are still very far from understanding prompts properly.

Full Text