Abstract
Recently, zero-shot image captioning (ZSIC) has gained significant attention, given its potential to describe unseen objects in images. This is important for real-world applications such as human–computer interaction, intelligent education, and service robots. However, the zero-shot image captioning method based on large-scale pretrained models may generate descriptions containing objects that are not present in the image, which is a phenomenon termed “object hallucination”. This is because large-scale models tend to predict words or phrases with high frequency, as seen in the training phase. Additionally, the method set a limitation to the description length, which often leads to an improper ending. In this paper, a novel approach is proposed to address and reduce the object hallucination and improper ending problem in the ZSIC task. We introduce additional emotion signals as guidance for sentence generation, and we find that proper emotion will filter words that do not appear in the image. Moreover, we propose a novel strategy that gradually extends the number of words in a sentence to confirm the generated sentence is properly completed. Experimental results show that the proposed method achieves the leading performance on unsupervised metrics. More importantly, the subjective examples illustrate the effect of our method in improving hallucination and generating properly ending sentences.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.