Abstract

Traditional video captioning requests a holistic description of the video, yet the detailed descriptions of the specific objects may not be available. Besides, most methods adopt frame-level inter-tangled features among objects and ambiguous descriptions during training, which is difficult for learning the vision-language relationships. Without identifying the classes and locations of separate objects, neither associating the transition trajectories among the objects, these data-driven image-based video captioning methods cannot reason the activities with visual features only, not to mention performing well under small-samples. We propose a novel task, namely the object-oriented video captioning, which focuses on understanding the videos in object-level. We re-annotate the object-oriented video captioning dataset (Object-Oriented Captions) with object-sentence pairs to facilitate more effective cross-modal learning. Thereafter, we design the video-based structured trajectory network via adversarial learning (STraNet) to effectively analyze the activities along the time domain and proactively capture the vision-language connections under small datasets. The proposed STraNet consists of four components: the structured trajectory representation, the attribute explorer, the attribute-enhanced caption generator, and the adversarial discriminator. The high-level structured trajectory representation provides useful supplement over previous image-based approaches, allowing to reason the activities from the temporal evolution of visual features and the dynamic movement of spatial locations. The attribute explorer helps to capture discriminative features among different objects, with which the subsequent caption generator can generate more informative and accurate descriptions. Finally, by adding an adversarial discriminator on the caption generation task, we can improve learning the inter-relationships between the visual contents and the corresponding visual words. To demonstrate the effectiveness, we evaluate the proposed method on the new dataset and compare it with the state-of-the-arts for video captioning. From the experimental results, the STraNet exhibits the ability of precisely describing the concurrent objects and their activities in details.

Highlights

  • With the rapid growth of videos on the Internet, it becomes more important to automatically understand the videos

  • In the paper, we propose a novel task of object-oriented video captioning scheme, which transforms the task of traditional video captioning in frame-level to the task of objectlevel analyses

  • Unlike most previous image-based methods, we propose the object-oriented video captioning framework for more realistic video understanding which can analyze the concurrent activities in details

Read more

Summary

Introduction

With the rapid growth of videos on the Internet, it becomes more important to automatically understand the videos. Video captioning, which calls for a systematic description of the videos, poses an intriguing challenge to learn the connections between vision and natural language [1]–[8], [10], [12]–[18], [28]–[34]. The associate editor coordinating the review of this manuscript and approving it for publication was Zhenhua Guo. With the methods of image captioning gradually getting matured [8]–[11], more and more attentions are shifted to video captioning [12]–[18]. Video captioning has significance to human-robot interaction and visually impaired people [6], [7]. Video captioning offers a holistic description of the entire video with less detailed information associated with objects, and this kind of video captioning cannot provide information of concurrent objects and activities as well. While human watching videos, instead of focusing on the

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call