Coarse-to-fine dual-level attention for video-text cross modal retrieval

Ming Jin,Huaxiang Zhang,Lei Zhu,Jiande Sun,Li Liu

doi:10.1016/j.knosys.2022.108354

Abstract

The effective representation of video features plays an important role in video vs. text cross-modal retrieval, and many researchers either use a single modal feature of the video or simply combine multi-modal features of the video. This makes the learned video features less robust. To enhance the robustness of video feature representation, we use coarse-fine-grained parallel attention model and feature fusion module to learn more effective video feature representation. Among them, coarse-grained attention learns the relationship between different feature blocks in the same modality feature and fine-grained attention applies attention to global features and strengthens the connection between points. Coarse-grained attention and Fine-grained attention complement each other. We integrate multi-head attention network into the model to expand the receptive field for features, and use the feature fusion module to further reduce the semantic gap between different video modalities. Our proposed model architecture not only strengthens the relationship between global features and local features, but also compensates the differences between different modality features in the video. Evaluation on three widely used datasets AcitivityNet-Captions, MSRVTT and LSMDC demonstrates its effectiveness.

Full Text