Abstract

The task of cross-modal image retrieval has recently attracted considerable research attention. In real-world scenarios, keyword-based queries issued by users are usually short and have broad semantics. Therefore, semantic diversity is as important as retrieval accuracy in such user-oriented services, which improves user experience. However, most typical cross-modal image retrieval methods based on single point query embedding inevitably result in low semantic diversity, while existing diverse retrieval approaches frequently lead to low accuracy due to a lack of cross-modal understanding. To address this challenge, we introduce an end-to-end solution termed variational multiple instance graph (VMIG), in which a continuous semantic space is learned to capture diverse query semantics, and the retrieval task is formulated as a multiple instance learning problems to connect diverse features across modalities. Specifically, a query-guided variational autoencoder is employed to model the continuous semantic space instead of learning a single-point embedding. Afterward, multiple instances of the image and query are obtained by sampling in the continuous semantic space and applying multihead attention, respectively. Thereafter, an instance graph is constructed to remove noisy instances and align cross-modal semantics. Finally, heterogeneous modalities are robustly fused under multiple losses. Extensive experiments on two real-world datasets have well verified the effectiveness of our proposed solution in both retrieval accuracy and semantic diversity.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.