Abstract

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101(75.7% top1 acc) and HMDB51(35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

Highlights

  • IntroductionVideos are very appealing as a data source for selfsupervision: there is almost an infinite supply available (from Youtube etc.); image level proxy losses can be used at the frame level; and, there are plenty of additional proxy losses that can be employed from the temporal information

  • Videos are very appealing as a data source for selfsupervision: there is almost an infinite supply available; image level proxy losses can be used at the frame level; and, there are plenty of additional proxy losses that can be employed from the temporal information

  • The contributions of this paper are three-fold: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos; second, we propose a curriculum training scheme that forces the model to only encode the slowly varying spatialtemporal representation, i.e. semantic embedding, and gradually learn to predict further in the future with progressively less temporal context; third, we evaluate the approach by first training the DPC model on the Kinetics-400 [15] dataset using self-supervised learning, and fine-tuning on action recognition benchmarks

Read more

Summary

Introduction

Videos are very appealing as a data source for selfsupervision: there is almost an infinite supply available (from Youtube etc.); image level proxy losses can be used at the frame level; and, there are plenty of additional proxy losses that can be employed from the temporal information. One of the most natural, and one of the first video proxy losses, is to predict future frames in the videos based on frames in the past. This has ample scope for exploration by varying the extent of the past knowledge (the temporal aggregation window used for the prediction) and the temporal distance into the future for the predicted frames. Approaches that only predict the frame embedding, such as Vondrick et al [40], avoid this potentially unnecessary task of detailed reconstruction, and use a mixture model to resolve the uncertainty in future prediction. Not applied to videos (but rather to speech signals and images), the Contrastive Predictive Coding (CPC) model of Oord et al [39] learns embeddings, in their case by using a multi-way classification over temporal audio frames (or image patches), rather than the regression loss of [40]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call