Abstract

Understanding videos, especially aligning them with textual data, presents a significant challenge in computer vision. The advent of vision-language models (VLMs) like CLIP has sparked interest in leveraging their capabilities for enhanced video understanding, showing marked advancements in both performance and efficiency. However, current methods often neglect vital user-generated metadata such as video titles. In this paper, we present Cap4Video++, a universal framework that leverages auxiliary captions to enrich video understanding. More recently, we witness the flourishing of large language models (LLMs) like ChatGPT. Cap4Video++ harnesses the synergy of vision-language models (VLMs) and large language models (LLMs) to generate video captions, utilized in three key phases: (i) Input stage employs Semantic Pair Sampling to extract beneficial samples from captions, aiding contrastive learning. (ii) Intermediate stage sees Video-Caption Cross-modal Interaction and Adaptive Caption Selection work together to bolster video and caption representations. (iii) Output stage introduces a Complementary Caption-Text Matching branch, enhancing the primary video branch by improving similarity calculations. Our comprehensive experiments on text-video retrieval and video action recognition across nine benchmarks clearly demonstrate Cap4Video++'s superiority over existing models, highlighting its effectiveness in utilizing automatically generated captions to advance video understanding.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.