Abstract

Image and sentence matching has attracted much attention recently, and many effective methods have been proposed to deal with it. But even the current state-of-the-arts still cannot well associate those challenging pairs of images and sentences containing few-shot content in their regions and words. In fact, such a few-shot matching problem is seldom studied and has become a bottleneck for further performance improvement in real-world applications. In this work, we formulate this challenging problem as few-shot image and sentence matching, and accordingly propose an Aligned Cross-Modal Memory (ACMM) model to deal with it. The model can not only softly align few-shot regions and words in a weakly-supervised manner, but also persistently store and update cross-modal prototypical representations of few-shot classes as references, without using any groundtruth region-word correspondence. The model can also adaptively balance the relative importance between few-shot and common content in the image and sentence, which leads to better measurement of overall similarity. We perform extensive experiments in terms of both few-shot and conventional image and sentence matching, and demonstrate the effectiveness of the proposed model by achieving the state-of-the-art results on two public benchmark datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call