Abstract
Image-text matching is a challenging task in cross-modal learning due to the discrepancy of data representation be-tween different modalities of images and texts. The main-stream methods adopt the late fusion to generate image-text similarity on encoded cross-modal features, and put effort to capture intra-modality associations with considerably high training cost. In this work, we propose to Combine Early and Late Fusion Together (CELFT), which is a universal hybrid fusion framework that can effectively overcome the above shortcomings of the late fusion scheme. In the pro-posed CELFT framework, the hybrid structure with early fusion and late fusion could facilitate the interaction between image and text modalities at early stage. Moreover, these two kinds of fusion strategies complement each other in capturing the inter-modal and intra-modal information, which ensure to learn more accurate image-text similarity. In the experiments, we choose four latest approaches based on the late fusion scheme as the base models, and integrate them with our CELFT framework. The results on two widely used image-text datasets MSCOCO and Flickr30K show that the matching performance of all base models is significantly improved with remarkably reduced training time.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have