Abstract

In this paper, we present recent experiments on using Artificial Neural Networks (ANNs), a new “delayed” approach to speech vs. non-speech segmentation, and extraction of large-scale pooling feature (LSPF) for detecting “events” within consumer videos, using the audio channel only. A “event” is defined to be a sequence of observations in a video, that can be directly observed or inferred. Ground truth is given by a semantic description of the event, and by a number of example videos. We describe and compare several algorithmic approaches, and report results on the 2013 TRECVID Multimedia Event Detection (MED) task, using arguably the largest such research set currently available. The presented system achieved the best results in most audio-only conditions. While the overall finding is that MFCC features perform best, we find that ANN as well as LSP features provide complementary information at various levels of temporal resolution. This paper provides analysis of both low-level and high-level features, investigating their relative contributions to overall system performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call