Abstract
Image–text retrieval is one challenging task to bridge the modality gap between vision and language. Although the mainstream late fusion schemes could facilitate intramodality correlations, it would result in heavy burden of computation resources and insufficient intermodal alignment. In this work, we propose comprehensive framework of early and late fusion (CFELF), a universal framework to collaborate early fusion with late fusion. To enhance cross-modal correspondence, CFELF fuses local visual regions with global sentences at the early stage to aggregate on late fusion backbones. Therefore, fusions on two phases of the feature process could be complementary to each other to capture salient information in intramodality while encouraging intermodal alignments. We have extensively evaluated CFELF on four advanced late fusion backbones and compare with other early fusion modules. The results on two public image–text datasets demonstrate the effectiveness of the comprehensive fusion framework in retrieval performance with convergence accelerating.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Similar Papers
More From: IEEE MultiMedia
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.