Abstract

The effective representation of video features plays an important role in video vs. text cross-modal retrieval, and many researchers either use a single modal feature of the video or simply combine multi-modal features of the video. This makes the learned video features less robust. To enhance the robustness of video feature representation, we use coarse-fine-grained parallel attention model and feature fusion module to learn more effective video feature representation. Among them, coarse-grained attention learns the relationship between different feature blocks in the same modality feature and fine-grained attention applies attention to global features and strengthens the connection between points. Coarse-grained attention and Fine-grained attention complement each other. We integrate multi-head attention network into the model to expand the receptive field for features, and use the feature fusion module to further reduce the semantic gap between different video modalities. Our proposed model architecture not only strengthens the relationship between global features and local features, but also compensates the differences between different modality features in the video. Evaluation on three widely used datasets AcitivityNet-Captions, MSRVTT and LSMDC demonstrates its effectiveness.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.