Abstract
Video compact representation aims to obtain a representation that could reflect the kernel mode of video content and concisely describe the video. As most information in complex videos is either noisy or redundant, some researchers have instead focused on long-term video semantics. Recent video compact representation methods heavily rely on the segmentation accuracy of video semantics. In this paper, we propose a novel framework to address these challenges. Specifically, we designed a novel continuous video semantic embedding model to learn the actual distribution of video words. First, an embedding model based on the continuous bag of words method is proposed to learn the video embeddings, integrated with a well-designed discriminative negative sampling approach, which helps emphasize the convincing clips in the embedding while weakening the influence of the confusing ones. Second, an aggregated distribution pooling method is proposed to capture the semantic distribution of kernel modes in videos. Finally, our well-trained model can generate compact video representations by direct inference, which provides our model with a better generalization ability compared with those of previous methods. We performed extensive experiments on event detection and the mining of representative event parts. Experiments on TRECVID MED11 and CCV datasets demonstrated the effectiveness of our method. Our method could capture the semantic distribution of kernel modes in videos and shows powerful potential to discover and better describe complex video patterns.
Highlights
Video representation is a classical topic in computer vision
We propose a novel discriminative negative sampling approach when training the continuous video embedding model to ensure that the learned semantic embedding of video words encode both the context coherence and the discriminative degree of video semantics
We propose a continuous video semantic embedding model to learn the actual distribution of video words
Summary
Video representation is a classical topic in computer vision. Generally, to obtain video representations, the primary task is to extract the useful features from the videos. Recurrent neural networks (RNNs), long short-term memory (LSTM) [7,11,12,13] and modified hierarchical recurrent neural encoder (HRNE) [14] are used to model temporal information and represent videos. These frame-level or segment-level features are learned from well-trained deep models and could embed the visual semantic information of video content and temporal information together
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.