Contrastive Language-Image Pretraining (CLIP) models have achieved significant success and have markedly improved the performance of various downstream tasks, including action recognition. However, how to effectively introduce knowledge into the field of action recognition remains an open question. In this work, an External and Priori Knowledge CLIP (EPK-CLIP) is proposed to introduce external knowledge into the model. To capture external knowledge, an external knowledge embedding module is proposed, which can generate and utilize human-object interaction relations as external knowledge, enabling the model to learn better features. Furthermore, the sparse regularization is introduced in the loss function, endowing the model with the ability to exploit the sparse priori knowledge inherent in the classification task. Finally, multiple inference module is proposed to obtain classification results from both direct and indirect perspectives. Specifically, the final classification result is obtained by fusing the outputs of different reasoning modules. Moreover, four external knowledge datasets: Kinetics-400-VC, Jester-VC, HMDB-51-VC, and UCF-101-VC are built and released for public usage, which is a multimodal extension of corresponding action datasets respectively. Under fully-supervised settings, our model achieves the top-1 accuracy of 84.3%, 97.1%, 82.9%, and 98.2% on Kinetics-400, Jester, HMDB-51, and UCF-101, respectively. In zero-shot experiments, our model also achieves state-of-the-art results, with top-1 accuracy of 51.6% and 77.7% on HMDB-51 and UCF-101, respectively. All related datasets and code can be found at https://github.com/geek12138/EPK-CLIP.
Read full abstract