Abstract

Multi-label video classification is a challenging problem in pattern recognition field, as it is difficult to grasp the occurring localizations of a huge number of labels in videos. To solve this problem, we propose a general framework named MALL-CNN, i.e., Multi-Attention Label Relation Learning Convolutional Neural Network. MALL-CNN not only builds the correspondences between labels and videos by an attention mechanism, but also captures label co-occurrence by a graph learning approach. Specifically, we introduce multiple instance learning to composite a set of frame-level features into a video-level feature. Then, video-level feature is mapped into the content-aware category representations in an improved attentional manner. Further, these representations are enhanced by a series of label relation graphs, which transform global label relationships to the label relationships of current video. With the three processes, frame feature aggregation, video feature mapping, and label relationship construction can be achieved in MALL-CNN for multi-label video classification. Extensive experiments on real-world scene benchmark Youtube-8M verify that MALL-CNN with only frame feature surpasses the state of the arts with multi-modal features as well as ensemble models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call