Image–text matching is a fundamental task in the multimodal research field, connecting computer vision and natural language processing by aligning visual content with corresponding textual descriptions. Accurate matching is critical for applications such as image captioning and text-based image retrieval yet remains challenging due to the differences in data modalities. This paper addresses these challenges by proposing a robust image–text matching model inspired by Contrastive Language–Image Pre-training (CLIP). Our approach employs the Vision Transformer (ViT) model as the image encoder and Bidirectional Encoder Representations from Transformers (Bert) as the text encoder, integrating these into a shared vector space to measure semantic similarity. We enhance the model’s training efficiency using the LiT-tuning paradigm to optimize learning through a cosine decay strategy for dynamic adjustment of the learning rate. We validate our method on two benchmark datasets, WuKong and Flickr30k, demonstrating that our model achieves superior performance and significantly improves key evaluation metrics. The results underscore the model’s effectiveness in achieving accurate and robust image–text alignment.
Read full abstract