Abstract

Cross-modal retrieval technology can help people quickly achieve mutual information between cooking recipes and food images. Both the embeddings of the image and the recipe consist of multiple representation subspaces. We argue that multiple aspects in the recipe are related to multiple regions in the food image. It is challenging to improve the cross-modal retrieval quality by making full use of the implicit connection between multiple subspaces of recipes and images. In this paper, we propose a multi-subspace implicit alignment cross-modal retrieval framework of recipes and images. Our framework learns multi-subspace information about cooking recipes and food images with multi-head attention networks; the implicit alignment at the subspace level promotes narrowing the semantic gap between recipe embeddings and food image embeddings; triple loss and adversarial loss are combined to help our framework for cross-modal learning. The experimental results show that our framework significantly outperforms to state-of-the-art methods in terms of MedR and R@K on Recipe 1M.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.