Abstract

In this paper, we propose to boost the cross-modal retrieval through mutually aligning images and captions on the aspects of both features and relationships. First, we propose a multi-feature based visual-semantic embedding (MVSE++) space to retrieve the candidates in another modality, which provides a more comprehensive representation of the visual content of objects and scene context in images. Thus, we have more potential to find a more accurate and detailed caption for the image. However, captioning concentrates the image contents by semantic description. The cross-modal neighboring relationships start from the visual and semantic sides are asymmetric. To retrieve a better cross-modal neighbor, we propose to re-rank the initially retrieved candidates according to the ${k}$ nearest reciprocal neighbors in MVSE++ space. The method is evaluated on the benchmark datasets of MSCOCO and Flickr30K with standard metrics. We achieve highe accuracy in caption retrieval and image retrieval at both R@1 and R@10.

Highlights

  • The task of image-caption retrieval aims at finding corresponding sentences given an image query or retrieving images with a sentence query

  • To retrieve a better cross-modal neighbor, we propose to re-rank the initially retrieved candidates according to the k nearest reciprocal neighbors in MVSE++ space

  • We show that the multifeature, which avoids the visual-semantic miss-alignment on both objects and scene context aspects, is more representative than the previously used single feature for image-caption retrieval

Read more

Summary

INTRODUCTION

The task of image-caption retrieval aims at finding corresponding sentences given an image query or retrieving images with a sentence query. VSE [16] embeds the deep visual features and deep semantic features into a cross-modal space based on a bi-direction ranking loss. We propose to retrieve the candidates in another modality in a multi-feature based VSE++ (MVSE++) space. It provides a more comprehensive representation of the visual content of objects and scene context in images. We propose multiple visual features based embedding method, which provides a more comprehensive representation of the visual content of objects and scene context in images. It provides us a better initial retrieval. To the best of our knowledge, we achieve the highest accuracy in caption retrieval and image retrieval at both R@1 and R@10

RELATED WORK
THE CROSS-MODAL K-RECIPROCAL NEAREST NEIGHBOR BASED RE-RANKING
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.