Abstract

Video compact representation aims to obtain a representation that could reflect the kernel mode of video content and concisely describe the video. As most information in complex videos is either noisy or redundant, some researchers have instead focused on long-term video semantics. Recent video compact representation methods heavily rely on the segmentation accuracy of video semantics. In this paper, we propose a novel framework to address these challenges. Specifically, we designed a novel continuous video semantic embedding model to learn the actual distribution of video words. First, an embedding model based on the continuous bag of words method is proposed to learn the video embeddings, integrated with a well-designed discriminative negative sampling approach, which helps emphasize the convincing clips in the embedding while weakening the influence of the confusing ones. Second, an aggregated distribution pooling method is proposed to capture the semantic distribution of kernel modes in videos. Finally, our well-trained model can generate compact video representations by direct inference, which provides our model with a better generalization ability compared with those of previous methods. We performed extensive experiments on event detection and the mining of representative event parts. Experiments on TRECVID MED11 and CCV datasets demonstrated the effectiveness of our method. Our method could capture the semantic distribution of kernel modes in videos and shows powerful potential to discover and better describe complex video patterns.

Highlights

  • Video representation is a classical topic in computer vision

  • We propose a novel discriminative negative sampling approach when training the continuous video embedding model to ensure that the learned semantic embedding of video words encode both the context coherence and the discriminative degree of video semantics

  • We propose a continuous video semantic embedding model to learn the actual distribution of video words

Read more

Summary

Introduction

Video representation is a classical topic in computer vision. Generally, to obtain video representations, the primary task is to extract the useful features from the videos. Recurrent neural networks (RNNs), long short-term memory (LSTM) [7,11,12,13] and modified hierarchical recurrent neural encoder (HRNE) [14] are used to model temporal information and represent videos. These frame-level or segment-level features are learned from well-trained deep models and could embed the visual semantic information of video content and temporal information together

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call