Zero-shot learning (ZSL) aims to learn a model trained on seen samples with the ability to recognize samples from unseen classes, while generalized ZSL (GZSL) takes a step closer to realistic scenarios by recognizing both of seen and unseen samples. The existing methods rely on the semantic descriptions as the side-information and conduct tight alignment between the visual and semantic spaces. However, the tight modality alignment may result in incomplete representations, leading to the loss of originally detailed and discriminative information. In this paper, we propose a simple yet effective superclass-aware visual feature disentangling method termed as SupVFD for GZSL. We use the neighbor relations of the semantic descriptions to define superclass and with the guide of superclass, our method disentangles visual features into discriminative and transferable factors. To this end, the semantic descriptions are used as implicit supervision, which preserves the valuable detailed and discriminative information in the visual features. The extensive experiments in both ZSL and GZSL settings prove our method outperforms the state-of-the-art methods for image object classification as well as video action recognition. Code is available at our github: https://github.com/changniu54/SupVFD-Master.
Read full abstract