Abstract
Composed query image retrieval task aims to retrieve the target image in the database by a query that composes two different modalities: a reference image and a sentence declaring that some details of the reference image need to be modified and replaced by new elements. Tackling this task needs to learn a multimodal embedding space, which can make semantically similar targets and queries close but dissimilar targets and queries as far away as possible. Most of the existing methods start from the perspective of model structure and design some clever interactive modules to promote the better fusion and embedding of different modalities. However, their learning objectives use conventional query-level examples as negatives while neglecting the composed query's multimodal characteristics, leading to the inadequate utilization of the training data and suboptimal construction of metric space. To this end, in this paper, we propose to improve the learning objective by constructing and mining hard negative examples from the perspective of multimodal fusion. Specifically, we compose the reference image and its logically unpaired sentences rather than paired ones to create component-level negative examples to better use data and enhance the optimization of metric space. In addition, we further propose a new sentence augmentation method to generate more indistinguishable multimodal negative examples from the element level and help the model learn a better metric space. Massive comparison experiments on four real-world datasets confirm the effectiveness of the proposed method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE transactions on image processing : a publication of the IEEE Signal Processing Society
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.