Self-supervised video-based action recognition is a challenging task, which needs to extract the principal information characterizing the action from content-diversified videos over large unlabeled datasets. However, most existing methods choose to exploit the natural spatio-temporal properties of video to obtain effective action representations from a visual perspective, while ignoring the exploration of the semantic that is closer to human cognition. For that, a self-supervised Video-based Action Recognition method with Disturbances called VARD, which extracts the principal information of the action in terms of the visual and semantic, is proposed. Specifically, according to cognitive neuroscience research, the recognition ability of humans is activated by visual and semantic attributes. An intuitive impression is that minor changes of the actor or scene in video do not affect one person's recognition of the action. On the other hand, different humans always make consistent opinions when they recognize the same action video. In other words, for an action video, the necessary information that remains constant despite the disturbances in the visual video or the semantic encoding process is sufficient to represent the action. Therefore, to learn such information, we construct a positive clip/embedding for each action video. Compared to the original video clip/embedding, the positive clip/embedding is disturbed visually/semantically by Video Disturbance and Embedding Disturbance. Our objective is to pull the positive closer to the original clip/embedding in the latent space. In this way, the network is driven to focus on the principal information of the action while the impact of sophisticated details and inconsequential variations is weakened. It is worthwhile to mention that the proposed VARD does not require optical flow, negative samples, and pretext tasks. Extensive experiments conducted on the UCF101 and HMDB51 datasets demonstrate that the proposed VARD effectively improves the strong baseline and outperforms multiple classical and advanced self-supervised action recognition methods.
Read full abstract