Abstract
Due to the prominent zero-shot generalization in new language tasks shown by large language models (LLMs), applying LLMs for zero-shot visual question answering (VQA) has been a new trend. However, most prior approaches directly use off-the-shelf captioning models to generate captions that compose in-context examples for LLMs, and such generated captions may be uninformative, thus leading the LLMs to give false predictions. To address this, we propose zero-shot VQA with feedback from LLMs (ZVQAF), a framework that applies LLMs to discriminate the quality of generated captions and leverages this feedback to train the captioning model. ZVQAF consists of two stages: the first stage is the training with feedback, which enables the captioning model to recognize the task objective and information requirements from the LLM, and the second stage is utilizing the optimized captioning model and LLM for inference. Extensive experiments show that ZVQAF achieves zero-shot VQA performance that is comparable or even superior to those previous zero-shot, few-shot, and end-to-end training approaches. For example, on VQAv2 test dataset, ZVQAF outperforms Flamingo (Alayrac et al., 2022) which employs end-to-end training by a large margin of 8.0%. In addition, on A-OKVQA dataset, ZVQAF outperforms zero-shot method Img2LLM (Guo et al., 2023) by 3.8% when employing LLMs with similar scales.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.