Abstract

In recent years, multimedia event detection has been attracting extensive research attention because of the exponential increase in volume of web video data. Traditional approaches usually utilize single visual representation, which may suffer from the problem of insufficient descriptive power. How to jointly employ multiple types of visual representation to facilitate multimedia event detection (MED) in videos remains an open problem. In this work, we propose a novel system for event detection based on combination of multi-view representations and co-training algorithm. Specifically, given several types of low-level visual features (i.e., Convolutional Neural Networks (CNNs) and Fisher vector), we first train an initial classifier for each type of visual feature. Then, we use these classifiers to separately predict labels of unlabeled videos, and those with consistent prediction are merged into the training set. We alternatively repeat the processes of training the classifiers and enlarging the training set until convergence. To investigate the relationship among different types of visual features, the prediction scores of the two classifiers are fused by a linear weighted fusion method. We evaluate our MED system on the TRECVID MED11 data set, and the experimental results have demonstrated the outstanding performance of the proposed approach as compared to several other state-of-the-art algorithms.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.