Efficient Image and Sentence Matching.

Yuming Wang,Liang Wang,Yan Huang

doi:10.1109/tpami.2022.3178485

Abstract

Recently, the accuracy of image and sentence matching has been continuously improved by larger and larger models. However, such large models not only need huge storage space but also slow down inference speed, which are not very suitable for low-cost devices in real-world applications. To our knowledge, this work makes the first attempt to improve the model efficiency in the context of image and sentence matching, and accordingly proposes a simple yet effective Whitened Similarity Distillation (WSD) method, which can distill cross-modal knowledge from a large teacher model to a small student model of both high efficiency and accuracy. The high efficiency is achieved by performing: 1) feature representation based on efficient backbone networks; and 2) similarity measurement in a fast N-to-N manner. However, the accuracy of such a student model is much worse than that of teacher model, because there exists very large variation inconsistency between two cross-modal similarity matrices of teacher and student models, which is hard to reduce during the similarity distillation. By performing two whitening-like transformations in the orthogonal space, the proposed WSD can reduce the large variation inconsistency more isotropically and is able to improve the accuracy of student model. We perform extensive experiments on two benchmark datasets and demonstrate the effectiveness of the proposed WSD. Compared with the teacher model, our distilled student model is 7× smaller (in model size) and 9× faster (in testing speed), only at the cost of 2% accuracy decrease.

Full Text