Temporal answer grounding in instructional video (TAGV) is a new task naturally derived from temporal sentence grounding in general video (TSGV). Given an untrimmed instructional video and a text question, this task aims at locating the frame span from the video that can semantically answer the question, i.e., visual answer. Existing methods tend to solve the TAGV problem with a visual span-based predictor, taking visual information to predict the start and end frames in the video. However, due to the weak correlations between the semantic features of the textual question and visual answer, current methods using the visual span-based predictor do not work well in the TAGV task. In this paper, we propose a visual-prompt text span localization (VPTSL) method, which introduces the timestamped subtitles for a text span-based predictor. Specifically, the visual prompt is a learnable feature embedding, which brings visual knowledge to the pre-trained language model. Meanwhile, the text span-based predictor learns joint semantic representations from the input text question, video subtitles, and visual prompt feature with the pre-trained language model. Thus, the TAGV is reformulated as the task of the visual-prompt subtitle span localization for the visual answer. Extensive experiments on five instructional video datasets, namely MedVidQA, TutorialVQA, VehicleVQA, CrossTalk and Coin, show that the proposed method outperforms several state-of-the-art (SOTA) methods by a large margin in terms of mIoU score, which demonstrates the effectiveness of the proposed visual prompt and text span-based predictor.
Read full abstract