Abstract

Recently, image-text matching has been intensively explored to bridge vision and language. Previous methods explore an inter-modality relationship between an image-text pair from the single-view feature. However, it is difficult to discover all the abundant information based on a single inter-modality relationship. In this paper, a novel Multi-View Inter-Modality Representation with Progressive Fusion (MIRPF) is developed to explore inter-modality relationships from multi-view features. The multi-view strategy provides more complementary and global semantic clues than single-view approaches. In particular, the multi-view inter-modality representation network is constructed to generate multiple inter-modality representations, which provide diverse views to discover the latent image-text relationships. Furthermore, the progressive fusion module is performed to fuse inter-modality features stepwise, which fully uses the inherent complementary between different views. Extensive experiments on Flickr30K and MSCOCO verify the superiority of MIRPF compared with several existing approaches. The code is available at: https://github.com/jasscia18/MIRPF.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.