Abstract

Semantic embedding learning for image and text has been well studied in recent years. In this paper, we present a simple while effective dual-encoder (image encoder and text encoder) framework to unify image and text into a common embedding space. Inspired by deep metric learning, we utilize triplet ranking loss to minimize the gap between the two embedding spaces. We train and test our proposed framework on Flickr8k, Flickr30k and MS-COCO datasets respectively, and evaluate the framework on the Corel1k benchmark dataset as an application. Using VGG-19 for image encoder, GRU for text encoder and triplet ranking loss, we gained obvious improvement versus baseline model on image annotation and image search tasks. Additionally, we explore the vector generated by our image encoder and the one by word embedding of plain word for some arithmetic operations. The above experiments demonstrate the effectiveness of our proposed learning framework.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.