Abstract

Image-text matching is a challenging task in cross-modal learning due to the discrepancy of data representation be-tween different modalities of images and texts. The main-stream methods adopt the late fusion to generate image-text similarity on encoded cross-modal features, and put effort to capture intra-modality associations with considerably high training cost. In this work, we propose to Combine Early and Late Fusion Together (CELFT), which is a universal hybrid fusion framework that can effectively overcome the above shortcomings of the late fusion scheme. In the pro-posed CELFT framework, the hybrid structure with early fusion and late fusion could facilitate the interaction between image and text modalities at early stage. Moreover, these two kinds of fusion strategies complement each other in capturing the inter-modal and intra-modal information, which ensure to learn more accurate image-text similarity. In the experiments, we choose four latest approaches based on the late fusion scheme as the base models, and integrate them with our CELFT framework. The results on two widely used image-text datasets MSCOCO and Flickr30K show that the matching performance of all base models is significantly improved with remarkably reduced training time.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.