Abstract

Recently, image-text matching based on local region-word semantic alignment has attracted considerable research attention. The fine-grained interplay can be achieved by aggregating the similarities of the region-word pairs. However, the similarities of aligned region-word pairs are treated equally in most cross-modal matching literatures, without considering their respective importance. Moreover, the local alignment methods are prone to bring about a global semantic drift due to the ignorance of thematic considerations for the image-text pairs. In this paper, a novel Dual-View Semantic Inference (DVSI) network is proposed to leverage both local and global semantic matching in a holistic deep framework. For the local view, a region enhancement module is proposed to mine the priorities for different regions in the image, which provides differentiate abilities to discover the latent region-word relationships. For the global view, the overall semantics of image is summarized for global semantic matching to avoid global semantic drift. The two views are unified together for final image-text matching. Extensive experiments conducted on MSCOCO and Flicr30K demonstrate the effectiveness of the proposed DVSI.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.