Learning From Expert: Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval

Lina Sun,Yewen Li,Yumin Dong

doi:10.1145/3591106.3592242

Abstract

Unsupervised cross-modal hashing (UCMH) has attracted increasing research due to its efficient retrieval performance and label irrelevance. However, existing methods have some bottlenecks: Firstly, the existing unsupervised methods suffer from inaccurate similarity measures due to the lack of correlation between features of different modalities and simple features cannot fully describe the fine-grained relationships of multi-modal data. Secondly, existing methods have rarely explored vision-language knowledge distillation schemes to distil multi-modal knowledge of these vision-language models to guide the learning of student networks. To address these bottlenecks, this paper proposes an effective unsupervised cross-modal hashing retrieval method, called Vision-Language Knowledge Distillation for Unsupervised Cross-Modal Hashing Retrieval (VLKD). VLKD uses the vision-language pre-training (VLP) model to encode features on multi-modal data, and then constructs a similarity matrix to provide soft similarity supervision for the student model. It distils the knowledge of the VLP model to the student model to gain an understanding of multi-modal knowledge. In addition, we designed an end-to-end unsupervised hashing learning model that incorporates a graph convolutional auxiliary network. The auxiliary network aggregates information from similar data nodes based on the similarity matrix distilled by the teacher model to generate more consistent hash codes. Finally, the teacher network does not require additional training, it only needs to guide the student network to learn high-quality hash representation, and VLKD is quite efficient in training and retrieval. Sufficient experiments on three multimedia retrieval benchmark datasets show that the proposed method achieves better retrieval performance compared to existing unsupervised cross-modal hashing methods, demonstrating the effectiveness of the proposed method.

Full Text