State-of-the-art image captioning often relies on supervised models trained on domain-specific image–text pairs, which can be expensive and time-consuming to annotate. Additionally, these models struggle with performance drops when applied to data from different distributions. To overcome these challenges, zero-shot image captioning has emerged, utilizing models trained on unlabeled image and text data. Pre-trained vision-language and generative language models have shown promise in this area, but existing methods do not fully leverage the cross-modal representation capabilities of large-scale models and tend to overlook smaller objects in images. We propose EntroCap, a novel zero-shot image captioning method that incorporates an Entropy-based Retrieval Strategy. During training, our Hierarchical Projector optimizes CLIP’s joint embeddings to capture comprehensive global signals with multi-level representations. During inference, the Entropy-based Retrieval Strategy adjusts prediction logits to better include smaller, less frequent objects. Additionally, our Balancing Gate mechanism guides the language model to produce more detailed and contextually relevant captions by balancing local and global signals. Extensive experiments on MSCOCO, Flickr30k, and NoCaps demonstrate that EntroCap outperforms previous zero-shot approaches in both in-domain and cross-domain settings. We also conduct experiments to quantitatively validate EntroCap’s effectiveness in describing small objects.
Read full abstract