Abstract
Image-text matching is a crucial aspect of multi-modal intelligence. The main challenge in this area is accurately measuring the relevance between the image and text, using evidence obtained through matching. Previous studies either concentrated on obtaining a well-represented global feature to measure similarity directly or on investigating complex matching patterns at a local level before aggregating them, with little attention paid to combining them. We propose a Globally Guided Confidence Enhancement Network that combines both approaches by obtaining a good global representation to guide fine-grained local interactions. In this process, content that better matches the text from a global perspective is enhanced and represented with confidence scores. Extensive experiments demonstrate that the approach we have employed achieves superior performance on Flickr30K and MSCOCO datasets.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.