Abstract

Deep metric learning has become a key component of cross-modal retrieval. By learning to pull the features of matched instances closer while pushing the features of mismatched instances farther away, one can learn highly robust multi-modal representations. Most existing cross-modal retrieval methods leverage vanilla triplet loss to train the network, which cannot adaptively penalize pairs with different hardness. Although various weighting strategies have been designed for unimodal matching tasks, few weighting strategies have been applied to cross-modal tasks due to the specificity of cross-modal tasks. While few weighting strategies are designed for cross-modal scenarios, they usually involve a lot of hyper-parameters, which require a lot of computational resources to fine-tune. In this paper, we introduce a new exponential loss, which can assign appropriate weights to individual positive and negative pairs according to their similarity so that it can adaptively penalize pairs with different hardness. Furthermore, the exponential loss has only two hyper-parameters, making it easier to find the optimal parameters to suit various data distributions in practice. Exponential loss can be universally applied to well-established cross-modal models and further boost their retrieval performance. We exhaustively ablate our method on Image-Text matching, Video-Text matching, as well as unimodal Image matching. Experimental results show that a standard model trained with exponential loss can achieve noticeable performance gains.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call