Abstract
Unsupervised representation learning for videos has recently achieved remarkable performance owing to the effectiveness of contrastive learning. Most works on video contrastive learning (VCL) pull all snippets from the same video into the same category, even if some of them are from different actions, leading to temporal collapse, i.e., the snippet representations of a video are invariable with the evolution of time. In this paper, we introduce a novel intra-video contrastive learning (intra-VCL) that further distinguishes intra-video actions to alleviate this issue, which includes an asynchronous long-term memory bank (that caches the representations of all snippets of each video) and mines an extra positive/negative snippet within a video based on the asynchronous long-term memory bank. In addition, since an asynchronous long-term memory bank is required for performing intra-VCL and asynchronous update of the long-term memory leads to inconsistencies when performing contrastive learning, we further propose a consistent contrastive module (CCM) to perform consistent intra-VCL. Specifically, in the CCM, we propose an intra-video self-attention refinement function to reduce the inconsistencies within the asynchronously updated representations (of all snippets of each video) in the long-term memory and an adaptive loss re-weighting to reduce unreliable self-supervision produced by inconsistent contrastive pairs. We call our method as consistent intra-VCL. Extensive experiments demonstrate the effectiveness of the proposed consistent intra-VCL, which achieves state-of-the-art performance on the standard benchmarks of self-supervised action recognition, with top-1 accuracies of 64.2% and 91.0% on HMDB-51 and UCF-101, respectively.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have