Deep learning based action recognition methods require large amount of labelled training data. However, labelling large-scale video data is time consuming and tedious. In this paper, we consider a more challenging few-shot action recognition problem where the training samples are few and rare. To solve this problem, memory network has been designed to use an external memory to remember the experience learned in training and then apply it to few-shot prediction during testing. However, existing memory-based methods just update the visual information with fixed label embeddings in the memory, which cannot adapt well to novel activities during testing. To alleviate the issue, we propose a novel end-to-end cross-modal memory network for few-shot activity recognition. Specifically, the proposed memory architecture stores the dynamic visual and textual semantics for some high-level attributes related to human activities. And the learned memory can provide effective multi-modal information for new activity recognition in the testing stage. Extensive experimental results on two video datasets, including HMDB51 and UCF101, indicate that our method could achieve significant improvements over other previous methods.
Read full abstract