Abstract

Self-supervised methods have significantly closed the gap with end-to-end supervised learning for image classification [13], [24]. In the case of human action videos, however, where both appearance and motion are significant factors of variation, this gap remains significant [28], [58]. One of the key reasons for this is that sampling pairs of similar video clips, a required step for many self-supervised contrastive learning methods, is currently done conservatively to avoid false positives. A typical assumption is that similar clips only occur temporally close within a single video, leading to insufficient examples of motion similarity. To mitigate this, we propose SLIC, a clustering-based self-supervised contrastive learning method for human action videos. Our key contribution is that we improve upon the traditional intra-video positive sampling by using iterative clustering to group similar video instances. This enables our method to leverage pseudo-labels from the cluster assignments to sample harder positives and negatives. SLIC outperforms state-of-the-art video retrieval baselines by <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$+15.4\%$</tex> on top-1 recall on UCF101 and by <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$+5.7\%$</tex> when directly transferred to HMDB51. With end-to-end finetuning for action classi-fication, SLIC achieves 83.2% top-1 accuracy <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(+0.8\%)$</tex> on UCF101 and 54.5% on HMDB51 <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$(+1.6\%$</tex> ,. SLIC is also competitive with the state-of-the-art in action classification after self-supervised pretraining on Kinetics400.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call