Abstract

Self-supervised contrastive learning has shown a significant improvement in performance for action recognition tasks by discovering useful signals from unlabeled videos. Nevertheless, the unique features of existing video benchmark datasets have led the learned video representations to be contextually biased toward dominant backgrounds and scene correlations. Thus, ultimately leading to poor generalizations on scene-invariant action recognition. Therefore, we propose Actor-aware Self-supervised Learning for Semi-supervised Video Representation Learning (ActorSL). We aligned localized actors and their corresponding scene information to encourage the model to learn discriminative regions and mitigate the model’s dependency on the video background during contrastive training. Furthermore, we present an inter-video Background Mixing (iBM) augmentation strategy to introduce scene consistency into the model. We patch inter-video crops of four randomly selected frames for iBM to create a unique frame for each video. The patched frame is blended with the target video frames to generate a spatially augmented sample. Then, the actor-scene aligned features and features of iBM-augmented videos are utilized to optimize contrastive loss and consistency regularization jointly in a semi-supervised way. Moreover, iBM combines the one-hot-encoded labels of patches with the label of the target video as a label smoothing regularizer to soften the decision boundaries of the semi-supervised model. Our experimental results reveal that, ActorSL notably improved current state-of-the-art semi-supervised methods on the Kinetics-400, UCF101, and HMDB51 datasets under a low-label regime. Code released at https://github.com/Endarzboy/ActorSL.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call