Abstract

In the realm of the video captioning field, acquiring large amounts of high-quality aligned video-text pairs remains laborious, impeding its practical applications. Therefore, we explore the modelling techniques for unsupervised video captioning. Using text inputs similar to the video representation to generate captions has been a successful unsupervised video captioning generation strategy in the past. However, this setting relies solely on the textual data for training, neglecting vital visual cues related to the spatio-temporal appearance within the video. The absence of visual information increases the risk of generating erroneous video captions. In view of this, we propose a novel unsupervised video captioning method that introduces visual information related to text features keywords to implicitly enhance training for text generation tasks. Simultaneously, our method incorporates sentence to explicitly augment the training process. our method injects additional implicit visual features and explicit keywords into the model, Which can inject the generated captions with more accurate semantics. the experimental analysis demonstrates the merit of the proposed formulation, achieving superior performance against the state-of-the-art unsupervised studies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call