Multi-Modal Multi-Grained Embedding Learning for Generalized Zero-Shot Video Classification

Mingyao Hong,Guorong Li,Qingming Huang,Xinfeng Zhang

doi:10.1109/tcsvt.2023.3262754

Abstract

Zero-shot learning aims to learn knowledge from existing information to classify new classes with no visual training data. In the current work on zero-shot video classification, only the category name information can be used for unseen classes. While, most of the category names cannot fully describe the entire video information, but are only precise labels assigned by humans to actions, in which the amount of information is very small. In order to make up for the semantic deficiencies of video databases and build relationships between categories, we propose a multi-modal generalized zero-shot video classification framework based on multi-grained semantic information with a proposed video description text database. Our model explores semantic knowledge from accurate but lacking informative category names and exhaustive but redundant description texts, and learns visual knowledge from semantic embeddings of varying granularity. Further, we use the learned semantic and visual knowledge to perform multi-grained classification on test video data with both seen and unseen classes. To describe actions in detail and provide complete semantic information, we propose a description text database. The textual descriptions, including category definitions and explanations, in our proposed textual database effectively help establish relationships between categories, thus providing a more reliable basis for visual feature synthesis. Furthermore, our framework generates synthesized features for unseen classes from both coarse-grained and fine-grained semantic information, which would effectively avoid the bias of generalized zero-shot learning on seen classes. Extensive experimental results on the database prove the validity of our method and the effectiveness of the description texts in generalized zero-shot video classification problems.

Full Text