Combine Early and Late Fusion Together: A Hybrid Fusion Framework for Image-Text Matching

Yifan Wang,Zuo Cao,Ruicong Xu,Heng Tao Shen,Wei Yu,Xing Xu

doi:10.1109/icme51207.2021.9428201

Yifan Wang, Zuo Cao + Show 4 more

https://doi.org/10.1109/icme51207.2021.9428201

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Image-text matching is a challenging task in cross-modal learning due to the discrepancy of data representation be-tween different modalities of images and texts. The main-stream methods adopt the late fusion to generate image-text similarity on encoded cross-modal features, and put effort to capture intra-modality associations with considerably high training cost. In this work, we propose to Combine Early and Late Fusion Together (CELFT), which is a universal hybrid fusion framework that can effectively overcome the above shortcomings of the late fusion scheme. In the pro-posed CELFT framework, the hybrid structure with early fusion and late fusion could facilitate the interaction between image and text modalities at early stage. Moreover, these two kinds of fusion strategies complement each other in capturing the inter-modal and intra-modal information, which ensure to learn more accurate image-text similarity. In the experiments, we choose four latest approaches based on the late fusion scheme as the base models, and integrate them with our CELFT framework. The results on two widely used image-text datasets MSCOCO and Flickr30K show that the matching performance of all base models is significantly improved with remarkably reduced training time.

Full Text