Abstract

Topic detection based on text reasoning has attracted widespread attention. Existing methods focus on inference based on textual semantic cues. However, each video is described with only a few words, resulting in sparse textual reasoning cues. In this situation, it is difficult to distinguish videos belonging to the same topic, making topic detection for web videos challenging. Fortunately, visual information contains many more detailed cues than textual information, such as colors, scenes, and objects. Cross-media joint reasoning provides more reasoning cues in a complementary manner than textual information. In view of this, this paper extends topic detection based on text reasoning to cross-media reasoning. A novel heterogeneous interactive tensor learning (HITL) method is proposed, which detects topics through cross-media joint inference. After extracting local features of keyframes and textual information, the semantic correlation between visual and textual information is mined by constructing a keyframe-text interaction attention matrix. Then, a joint cue between textual and visual information is constructed in a cross-media heterogeneous interaction tensor space, thereby achieving rich textual cues through cross-media fusion. Finally, semantic features are extracted through cue interaction in tensor space for topic detection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call