Abstract
Abstract—This paper proposes a sophisticated workflow designed to enhance the interaction between users and long-form video content, particularly aiding in efficient video segment retrieval based on specific user queries. The proposed workflow eliminates the need for manual video browsing by automatically identifying and returning relevant timestamps for requested actions or events. By structuring the query as a Hierarchical Query Processor, which decomposes user requests into temporally dependent sub-queries, and incorporating a Timestamp-Aware Frame Encoder to associate visual frames with precise timestamps, the system effectively models video content for time-sensitive retrieval. The following methods are integrated to optimize performance: Sliding Video Q-Former to capture temporal relationships across frames, Temporal Attention Cache for efficient reuse of pre-computed attention patterns, and a Language Model to process queries and generate precise timestamped responses. This innovation holds particular value for applications in instructional media, surveillance analysis, and content search, where time-sensitive accuracy and contextual understanding are crucial. Keywords—Video Comprehension – Temporal Localization – Hierarchical Query Processing – Timestamp Embedding – Attention Caching – Large Language Models (LLMs)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have