In view of the insufficiency of the text encoder using CLIP and the insufficiency of the interaction between the two towers using CLIP, a CLIP-based video description model RAMSG is proposed, which combines retrieval augmentation with multi-scale semantic guidance. Firstly, RAMSG uses the visual and text encoder module of CLIP to achieve cross-modal retrieval and extract relevant text as a supervisory signal. Then, the intrinsic order of semantic words is obtained by a semantic detector and ranker. Finally, the global and local semantic guidance of the multi-scale semantic guidance module is used to improve the video description generation effect of the decoder module. Experimental results on the video description datasets MSR-VTT and VATEX show that the RAMSG model has a significant improvement in several performance indicators compared with other work, and the additional text semantics obtained through videotext matching tasks greatly improve the performance of the model.
Read full abstract