Abstract

Recently, a particular type of image-text retrieval task named Ternary Image-Text Retrieval (TITR) has drawn increasing attention. In this task, the total inputs of query and target consist of three components, rather than two inputs in the widely-studied retrieval case. The typical TITR scenarios include recipe retrieval (e.g., ingredients text, instructions text and food images) and fashion search (e.g., original images, text and modified images). A few recently proposed TITR methods mainly focus on learning the semantic correlations of two modality data by projecting them to the same embedding space to capture the alignment between image and text modalities. Nevertheless, two limitations still exist in these methods: 1) the underlying difference between data in the same modality (e.g., ingredients and instructions) is neglected; and 2) the trilinear interaction among the three inputs is implicitly captured. To this end, we propose a novel fusion framework named Trilinear FUsion Network (TFUN) to utilize high-level associations between these three inputs simultaneously and learn an accurate cross-modal similarity function via bi-directional triplet loss explicitly, which is generic for the TITR task. To reduce the model complexity, we introduce the advanced method of tensor decomposition to ensure computational efficiency and accessibility. We also develop a three-stage hard triplet sampling scheme to ensure fast convergence. Extensive experiments on three large-scale TITR datasets Recipe1M, Fashion200k and FashionIQ demonstrate the superiority of our proposed TFUN model compared to the state-of-the-art approaches. The implementation code and additional instructions are provided at https://github.com/CFM-MSG/Code_TFUN.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.