Abstract

In this paper, we investigate the task of Fine-grained Sketch-based Image Retrieval (FG-SBIR), which uses hand-drawn sketches as input queries to retrieve the relevant images at the fine-grained instance level. The sketches and images come from different modalities, thus the similarity computation needs to consider both fine-grained and cross-modal characteristics. Existing solutions only focus on fine-grained details or spatial contexts, while ignoring the channel context and spatial sequence information. To mitigate such challenging problems, we propose a novel deep FG-SBIR model, which aims at inferring attention maps along channel dimension and spatial dimension, improving modules of channel attention and spatial attention, and exploring Transformer to enhance the model’s ability for constructing and understanding spatial sequence information. We focus not only on the correlation information between two modalities of sketch and image, but also on the discrimination information inside the single modality. Mutual Loss is especially proposed to enhance the traditional triplet loss, and promote the internal discrimination ability of the model on a single modality. Extensive experiments show that our AE-Net obtains promising results on Sketchy, which is the largest public dataset available for FG-SBIR at present.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call