Abstract

Video understanding has become increasingly important as surveillance, social, and informational videos weave themselves into our everyday lives. Video captioning offers a way to summarize, index, and search the data. Most captioning models utilize a video encoder and caption decoder framework. Hierarchical encoders can abstractly capture clip level temporal features to represent a video, but the clips are at fixed time steps. This paper introduces a novel Multistream Hierarchical Boundary (MHB) model which combines a fixed hierarchy recurrent architecture with a soft hierarchy layer by using intrinsic feature boundary cuts within a video to define clips. A novel parametric Gaussian attention allows handling of variable length videos. The intrinsic properties of videos are utilized to form an adaptive hierarchical video representation. This model is trained in an end-to-end fashion for video captioning. The MHB model demonstrates state-of-the-art video captioning results on recent datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call