Abstract

Many current video and text cross-modal retrieval research works focus on narrowing the semantic gap between video and text, but ignore the semantic difference between different sampled frames in the same video and the correlation of feature distribution of objects contained in different sampled frames in the same video, as a result, the features of the sampled frames in the final learned video cannot well represent the semantic features of the whole video. To overcome the shortcomings of existing studies, we first use a pre-trained video frame classification-aggregation network to make the object categories contained in different sampled frames in the same video be more close to the important object categories contained in the whole video, so as to promote the feature distribution of different sampled frames in the same video to be consistent, and increase the relevance of object features in different frames. Then we propose a video internal frame aggregation loss module to solve the problem of inconsistent feature distribution between different frame features encoded by video encoder in the same video and the aggregation feature of the sampled frame, thus enhancing the ability of video sampled frame aggregation feature representation. Experiments conducted on three common datasets MSVD, MSR-VTT and DiDeMo demonstrate the validity of the proposed approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call