Abstract

In this paper, we present a novel approach of extracting the key segments for event detection in unconstrained videos. The key segments are automatically extracted by transferring the knowledge learned from Web images and Web videos to consumer videos. We propose an adaptive latent structural support vector machine model, where the locations of key segments in videos are regarded as latent variables due to the unavailability of the ground truth of key-segment locations in training data. In order to alleviate the time-consuming and labor-expensive manual annotation of huge amounts of training videos, a large number of loosely labeled Web images as well as videos are collected from the Web sources. Additionally, a limited number of labeled consumer videos are utilized to guarantee the precision of the model. Considering the semantic diversity of key segments, we learn a set of concepts as the semantic description of key segments and explore the temporal information of concepts to capture the sequential relations between the segments. The concepts are automatically discovered by using Web images and videos with their associated tags and description sentences. Comprehensive experiments on the Columbia's consumer video and the TRECVID 2014 Multimedia Event Detection datasets demonstrate that our method outperforms the state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.