Generalized zero-shot video classification aims to train a classifier to classify videos including both seen and unseen classes. Since the unseen videos have no visual information during training, most existing methods rely on the generative adversarial networks to synthesize visual features for unseen classes through the class embedding of category names. However, most category names only describe the content of the video, ignoring other relational information. As a rich information carrier, videos include actions, performers, environments, etc., and the semantic description of the videos also express the events from different levels of actions. In order to use fully explore the video information, we propose a fine-grained feature generation model based on video category name and its corresponding description texts for generalized zero-shot video classification. To obtain comprehensive information, we first extract content information from coarse-grained semantic information (category names) and motion information from fine-grained semantic information (description texts) as the base for feature synthesis. Then, we subdivide motion into hierarchical constraints on the fine-grained correlation between event and action from the feature level. In addition, we propose a loss that can avoid the imbalance of positive and negative examples to constrain the consistency of features at each level. In order to prove the validity of our proposed framework, we perform extensive quantitative and qualitative evaluations on two challenging datasets: UCF101 and HMDB51, and obtain a positive gain for the task of generalized zero-shot video classification.
Read full abstract