Video Class-Incremental Learning With Clip Based Transformer
Vision Language Pre-training Models have shown significant potential in various domains, but there are few attempts to introduce it in the field of continual learning for video action recognition. We propose Video Class-Incremental Learner with CLIP based Transformer (VCIL-CT), which uses CLIP based vision transformer to train action recognition task by class-incremental learning pipeline. To specifically address the issue of catastrophic forgetting in transformer, we introduce Attention Distillation which distilling the attention feature from each transformer decoder. In the process of incremental learning of classes, there may be a problem of high bias towards new classes, we incorporate Class Balance Module to prevent bias on new task. Furthermore, we adopt Exemplar Augment strategy to improve exemplar quality on data replay step. We evaluate our proposed method based on the incremental action recognition benchmark presented by TCD, using UCF101, HMDB51, and UESTC-MMEA-CL datasets, and demonstrate the effectiveness of our algorithm compared to existing state-of-the-art continuous learning methods for action recognition.