In the field of cross-modal retrieval, single encoder models tend to perform better than dual encoder models, but they suffer from high latency and low throughput. In this paper, we propose a dual encoder model called BagFormer that utilizes bag-wise late interaction mechanism to improve re-rank performance without sacrificing latency and throughput. BagFormer achieves this by employing a bagging layer, which facilitates the transformation of text to an appropriate granularity. This not only mitigates the issue of modal granularity mismatch but also enables the integration of entity knowledge into the model. Our experiments have shown that BagFormerViT-B outperforms the traditional dual-encoder model CLIPViT-B by 7.97% in zero-shot settings. Under fine-tuned conditions, BagFormerViT-B demonstrates an even more significant improvement of 17.98% over CLIPViT-B. Moreover, BagFormer not only matches the performance of cutting-edge single-encoder models in cross-modal retrieval tasks but also provides efficient inference processes characterized by lower latency and higher throughput. Compared to single-encoder models, BagFormer can achieve a speedup ratio of 38.14 when re-ranking individual candidates. Code and models are available at github.com/howard-hou/BagFormer.
Read full abstract