TVT: Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval

Jialin Tian,Heng Tao Shen,Xing Xu,Yang Yang,Fumin Shen

doi:10.1609/aaai.v36i2.20136

Abstract

In this paper, we study the zero-shot sketch-based image retrieval (ZS-SBIR) task, which retrieves natural images related to sketch queries from unseen categories. In the literature, convolutional neural networks (CNNs) have become the de-facto standard and they are either trained end-to-end or used to extract pre-trained features for images and sketches. However, CNNs are limited in modeling the global structural information of objects due to the intrinsic locality of convolution operations. To this end, we propose a Transformer-based approach called Three-Way Vision Transformer (TVT) to leverage the ability of Vision Transformer (ViT) to model global contexts due to the global self-attention mechanism. Going beyond simply applying ViT to this task, we propose a token-based strategy of adding fusion and distillation tokens and making them complementary to each other. Specifically, we integrate three ViTs, which are pre-trained on data of each modality, into a three-way pipeline through the processes of distillation and multi-modal hypersphere learning. The distillation process is proposed to supervise fusion ViT (ViT with an extra fusion token) with soft targets from modality-specific ViTs, which prevents fusion ViT from catastrophic forgetting. Furthermore, our method learns a multi-modal hypersphere by performing inter- and intra-modal alignment without loss of uniformity, which aims to bridge the modal gap between modalities of sketch and image and avoid the collapse in dimensions. Extensive experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin, and QuickDraw, demonstrate the superiority of our TVT method over the state-of-the-art ZS-SBIR methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

TVT: Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Jun 28, 2022
Citations: 16

Similar Papers

An efficient framework for zero-shot sketch-based image retrieval
Osman Tursun ... Clinton Fookes
Pattern Recognition | VOL. 126
Osman Tursun, et. al.Osman Tursun ... Clinton Fookes
21 Jan 2022
Pattern Recognition | VOL. 126

Doodle to Search: Practical Zero-Shot Sketch-Based Image Retrieval
Sounak Dey ... Yi-Zhe Song
-
Sounak Dey, et. al.Sounak Dey ... Yi-Zhe Song
01 Jun 2019
01 Jun 2019

Zero-Shot Sketch Based Image Retrieval Using Graph Transformer
Sumrit Gupta ... Ushasi Chaudhuri
-
Sumrit Gupta, et. al.Sumrit Gupta ... Ushasi Chaudhuri
21 Aug 2022
21 Aug 2022

Adaptive Margin Diversity Regularizer for Handling Data Imbalance in Zero-Shot SBIR
Titir Dutta ... Anurag Singh
-
Titir Dutta, et. al.Titir Dutta ... Anurag Singh
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TVT: Three-Way Vision Transformer through Multi-Modal Hypersphere Learning for Zero-Shot Sketch-Based Image Retrieval

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence