Semi-supervised learning for video action recognition is a very challenging research area. Existing state-of-the-art methods perform data augmentation on the temporality of actions, which are combined with the mainstream consistency-based semi-supervised learning framework FixMatch for action recognition. However, these approaches have the following limitations: (1) data augmentation based on video clips lacks coarse-grained and fine-grained representations of actions in temporal sequences, and the models have difficulty understanding synonymous representations of actions in different motion phases. (2) Pseudo labeling selection based on the constant thresholds lacks a “make-up curriculum” for difficult actions, that results in the low utilization of unlabeled data corresponding to difficult actions. To address the above shortcomings, we propose a semi-supervised action recognition via the temporal augmentation using curriculum learning (TACL) algorithm. Compared to previous works, TACL explores different representations of the same semantics of actions in temporal sequences for video and uses the idea of curriculum learning (CL) to reduce the difficulty of the model training process. First, for different action expressions with the same semantics, we designed the temporal action augmentation (TAA) for videos to obtain coarse-grained and fine-grained action expressions based on constant-velocity and hetero-velocity methods, respectively. Second, we construct a temporal signal to constrain the model such that fine-grained action expressions containing different movement phases have the same prediction results, and achieve action consistency learning (ACL) by combining the label and pseudo-label signals. Finally, we propose action curriculum pseudo labeling (ACPL), a loosely and strictly parallel dynamic threshold evaluation algorithm for selecting and labeling unlabeled data. We evaluate TACL on three standard public datasets: UCF101, HMDB51, and Kinetics. The combined experiments show that TACL significantly improves the accuracy of models trained on a small amount of labeled data and better evaluates the learning effects for different actions.
Read full abstract