Abstract

Image retrieval with text feedback is an emerging research topic with the objective of integrating inputs from multiple modalities as queries. In this paper, queries contain a reference image plus text feedback that describes modifications between this image and the desired image. The existing work for this task mainly focuses on designing a new fusion network to compose the image and text. Still, little research pays attention to the modality gap caused by the inconsistent distribution of features from different modalities, which dramatically influences the feature fusion and similarity learning between queries and the desired image. We propose a Distribution-Aligned Text-based Image Retrieval (DATIR) model, which consists of attention mutual information maximization and hierarchical mutual information maximization, to bridge this gap by increasing non-linear statistic dependencies between representations of different modalities. More specifically, attention mutual information maximization narrows the modality gap between different input modalities by maximizing mutual information between the text representation and its semantically consistent representation captured from the reference image and the desired image by the difference transformer. For hierarchical mutual information maximization, it aligns distributions of features from the image modality and the fusion modality by estimating mutual information between a single-layer representation in the fusion network and the multi-level representations in the desired image encoder. Extensive experiments on three large-scale benchmark datasets demonstrate that we can bridge the modality gap between different modalities and achieve state-of-the-art retrieval performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.