Abstract

The task of image-text matching involves the information of both image and text, and the foremost challenge is how to find the correspondence between image and text. The existing work is based on the trained visual features to train the matching network or uses the complex Transformer network architecture to extract image features and text features to complete the matching. In this paper, we design a simple image-text matching network that can be trained end-to-end. We have conducted comparative experiments on Flickr30K and MSCOCO datasets, and the results show that our network performance is better than the latest methods on Flickr30K dataset, and we have further discussed our network performance in the VQAv2 dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call