Abstract

Due to the prominent zero-shot generalization in new language tasks shown by large language models (LLMs), applying LLMs for zero-shot visual question answering (VQA) has been a new trend. However, most prior approaches directly use off-the-shelf captioning models to generate captions that compose in-context examples for LLMs, and such generated captions may be uninformative, thus leading the LLMs to give false predictions. To address this, we propose zero-shot VQA with feedback from LLMs (ZVQAF), a framework that applies LLMs to discriminate the quality of generated captions and leverages this feedback to train the captioning model. ZVQAF consists of two stages: the first stage is the training with feedback, which enables the captioning model to recognize the task objective and information requirements from the LLM, and the second stage is utilizing the optimized captioning model and LLM for inference. Extensive experiments show that ZVQAF achieves zero-shot VQA performance that is comparable or even superior to those previous zero-shot, few-shot, and end-to-end training approaches. For example, on VQAv2 test dataset, ZVQAF outperforms Flamingo (Alayrac et al., 2022) which employs end-to-end training by a large margin of 8.0%. In addition, on A-OKVQA dataset, ZVQAF outperforms zero-shot method Img2LLM (Guo et al., 2023) by 3.8% when employing LLMs with similar scales.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call