Abstract

The problem of event detection in multimedia clips is typically handled by modeling each of the component modalities independently, then combining their detection scores in a late fusion approach. One of the problems of a late fusion model in the multimedia setting is that the detection scores may be missing from one or more components for a given clip; e.g., when there is no speech in the clip; or when there is no overlay text. Standard fusion techniques typically address this problem by assuming a default backoff score for a component when its detection score is missing for a clip. This may potentially bias the fusion model, especially if there are many missing detections from a given component. In this work, we present the Sparse Conditional Mixture Model (SCMM) which models only the observed detection scores for each example, thereby avoiding making any assumptions about the distributions of the scores that are made by backoff models. Our experiments in multi-media event detection using the TRECVID-2011 corpus demonstrates that SCMM achieves statistically significant performance gains over standard late fusion techniques. The SCMM model is very general and is applicable to fusion problems with missing data in any domain.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call