Abstract

We propose a novel weakly supervised framework that jointly tackles entity analysis tasks in vision and language. Given a video with subtitles, we jointly address the questions: a) What do the textual entity mentions refer to? and b) What/ who are in the video key frames? We use a Markov Random Field (MRF) to encode the dependencies within and across the two modalities. This MRF model incorporates beliefs using independent methods for the textual and visual entities. These beliefs are propagated across the modalities to jointly derive the entity labels. We apply the framework to a challenging dataset of wildlife documentaries with subtitles and show that this integrated modeling yields significantly better performance over text-based and vision-based approaches. We show that textual mentions that cannot be resolved using text-only methods are resolved correctly using our method. The approaches described here bring us closer to automated multimedia indexing.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call