Abstract

Deep metric learning has become a key component of cross-modal retrieval. By learning to pull the features of matched instances closer while pushing the features of mismatched instances farther away, one can learn highly robust multi-modal representations. Most existing cross-modal retrieval methods leverage vanilla triplet loss to train the network, which cannot adaptively penalize pairs with different hardness. Although various weighting strategies have been designed for unimodal matching tasks, few weighting strategies have been applied to cross-modal tasks due to the specificity of cross-modal tasks. While few weighting strategies are designed for cross-modal scenarios, they usually involve a lot of hyper-parameters, which require a lot of computational resources to fine-tune. In this paper, we introduce a new exponential loss, which can assign appropriate weights to individual positive and negative pairs according to their similarity so that it can adaptively penalize pairs with different hardness. Furthermore, the exponential loss has only two hyper-parameters, making it easier to find the optimal parameters to suit various data distributions in practice. Exponential loss can be universally applied to well-established cross-modal models and further boost their retrieval performance. We exhaustively ablate our method on Image-Text matching, Video-Text matching, as well as unimodal Image matching. Experimental results show that a standard model trained with exponential loss can achieve noticeable performance gains.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.