Abstract

We propose a novel end-to-end image-text matching approach considering semantic uncertainty (SU-ITM), aiming to deal with the one-to-many semantic diversity involved in image-text matching in order to capture the associations between them more comprehensively and improve the robustness of the model. Traditional methods map images and texts as definite points in an embedding space to measure cross-modal similarity. However, the point-based embedding cannot capture the semantic uncertainty, leading to a large bias in the matching results. To address this problem, we model the one-to-many associations between image and text in a way that establishes a probability distribution, incorporating the uncertainty information into the final semantic representation of the text. In addition, we optimize the image-text matching loss so that the different text features approximate the image features in a distributed manner while maintaining the discriminative nature of the semantic representation, effectively reducing the matching uncertainty. Notably, our method achieves end-to-end training by not using pre-trained target detection branches throughout the training process. We fully demonstrate the excellent performance of our method in the image-text matching task through experimental validation on Flickr30k and MSCOCO. Excellent performance levels of 546.1 and 545.0 are achieved on the R@SUM metric for Flickr30k and MSCOCO 1k, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.