Less is Better: Exponential Loss for Cross-Modal Matching

Jiwei Wei,Yang Yang,Xing Xu,Guoqing Wang,Heng Tao Shen,Jingkuan Song

doi:10.1109/tcsvt.2023.3249754

Abstract

Deep metric learning has become a key component of cross-modal retrieval. By learning to pull the features of matched instances closer while pushing the features of mismatched instances farther away, one can learn highly robust multi-modal representations. Most existing cross-modal retrieval methods leverage vanilla triplet loss to train the network, which cannot adaptively penalize pairs with different hardness. Although various weighting strategies have been designed for unimodal matching tasks, few weighting strategies have been applied to cross-modal tasks due to the specificity of cross-modal tasks. While few weighting strategies are designed for cross-modal scenarios, they usually involve a lot of hyper-parameters, which require a lot of computational resources to fine-tune. In this paper, we introduce a new exponential loss, which can assign appropriate weights to individual positive and negative pairs according to their similarity so that it can adaptively penalize pairs with different hardness. Furthermore, the exponential loss has only two hyper-parameters, making it easier to find the optimal parameters to suit various data distributions in practice. Exponential loss can be universally applied to well-established cross-modal models and further boost their retrieval performance. We exhaustively ablate our method on Image-Text matching, Video-Text matching, as well as unimodal Image matching. Experimental results show that a standard model trained with exponential loss can achieve noticeable performance gains.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Less is Better: Exponential Loss for Cross-Modal Matching

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology

Lead the way for us

Journal: IEEE Transactions on Circuits and Systems for Video Technology	Publication Date: Sep 1, 2023
Citations: 6

Similar Papers

Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching.
Haiwen Diao ... Huchuan Lu
IEEE Transactions on Image Processing | VOL. PP
Haiwen Diao, et. al.Haiwen Diao ... Huchuan Lu
01 Jan 2024
IEEE Transactions on Image Processing | VOL. PP

Two-stage Cross Modal Hashing Retrieval Based on Composite Quantization and Cauchy Distribution
Xiaorui Li ... Jiahang He
-
Xiaorui Li, et. al.Xiaorui Li ... Jiahang He
27 May 2022
27 May 2022

OTCMR: Bridging Heterogeneity Gap with Optimal Transport for Cross-modal Retrieval
Mingyang Li ... Shao-Lun Huang
-
Mingyang Li, et. al.Mingyang Li ... Shao-Lun Huang
26 Oct 2021
26 Oct 2021

Deep-Learning-based Cross-Modal Luxury Microblogs Retrieval
Menghao Ma ... Wenhe Feng
-
Menghao Ma, et. al.Menghao Ma ... Wenhe Feng
11 Dec 2021
11 Dec 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Less is Better: Exponential Loss for Cross-Modal Matching

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology