Abstract

Existing video captioning methods usually ignore the important fine-grained semantic attributes, the video diversity, as well as the association and motion state between objects within and between frames. Thus, they cannot adapt to small sample data sets. To solve the above problems, this paper proposes a novel video captioning model and an adversarial reinforcement learning strategy. Firstly, an object-scene relational graph model is designed based on the object detector and scene segmenter to express the association features. The graph is encoded by the graph neural network to enrich the expression of visual features. Meanwhile, a trajectory-based feature representation model is designed to replace the previous data-driven method to extract motion and attribute information, so as to analyze the object motion in the time domain and establish the connection between the visual content and language under small data sets. Finally, an adversarial reinforcement learning strategy and a multi- branch discriminator are designed to learn the relationship between the visual content and corresponding words so that rich language knowledge is integrated into the model. Experimental results on three standard datasets and one small sample dataset indicate that our proposed method achieves state-of-the-art performance. Also, the ablation experiments and visualization results verify the effectiveness of proposed each strategy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call