Recently, image-text matching based on local region-word semantic alignment has attracted considerable research attention. The fine-grained interplay can be achieved by aggregating the similarities of the region-word pairs. However, the similarities of aligned region-word pairs are treated equally in most cross-modal matching literatures, without considering their respective importance. Moreover, the local alignment methods are prone to bring about a global semantic drift due to the ignorance of thematic considerations for the image-text pairs. In this paper, a novel Dual-View Semantic Inference (DVSI) network is proposed to leverage both local and global semantic matching in a holistic deep framework. For the local view, a region enhancement module is proposed to mine the priorities for different regions in the image, which provides differentiate abilities to discover the latent region-word relationships. For the global view, the overall semantics of image is summarized for global semantic matching to avoid global semantic drift. The two views are unified together for final image-text matching. Extensive experiments conducted on MSCOCO and Flicr30K demonstrate the effectiveness of the proposed DVSI.
Read full abstract