Abstract

The rapid growth of tag-free user-generated videos (on the Internet), surgical recorded videos, and surveillance videos has necessitated the need for effective content-based video retrieval systems. Earlier methods for video representations are based on hand-crafted, which hardly performed well on the video retrieval tasks. Subsequently, deep learning methods have successfully demonstrated their effectiveness in both image and video-related tasks, but at the cost of creating massively labeled datasets. Thus, the economic solution is to use freely available unlabeled web videos for representation learning. In this regard, most of the recently developed methods are based on solving a single pretext task using 2D or 3D convolutional network. However, this paper designs and studies a 3D convolutional autoencoder (3D-CAE) for video representation learning (since it does not require labels). Further, this paper proposes a new unsupervised video feature learning method based on joint learning of past and future prediction using 3D-CAE with temporal contrastive learning. The experiments are conducted on UCF-101 and HMDB-51 datasets, where the proposed approach achieves better retrieval performance than state-of-the-art. In the ablation study, the action recognition task is performed by fine-tuning the unsupervised pre-trained model where it outperforms other methods, which further confirms the superiority of our method in learning underlying features. Such an unsupervised representation learning approach could also benefit the medical domain, where it is expensive to create large label datasets.

Highlights

  • Most of these videos are unlabeled or semantic less tagged, making video analysis and searching a difficult task. These falsely semantically tagged clips or misrepresented short videos are created to entice or mislead consumers by posing as fake news (Cao et al, 2020). Other sources such as news agencies and surveillance networks have emerged in large quantities of video recording, Kumar et al.: Learning Unsupervised Visual Representations using 3D Convolutional Autoencoder

  • With future frames prediction task and past frames prediction task, the features learned on top of 3D convolutional autoencoder (3D-CAE) show further improvement which is reflected in retrieval accuracy

  • A novel unsupervised video representation learning technique is proposed, where video features are learned via joint learning of future frames and past frames prediction pretext task

Read more

Summary

Introduction

Since the inception of the Internet, the number of videos produced, uploaded, and downloaded from the World Wide Web has been expanding constantly. Deep learning has emerged as successful and powerful in computer vision tasks that include classification (Karpathy et al, 2014; Krizhevsky et al, 2012), segmentation (Shelhamer et al, 2017), gesture recognition (Jain et al, 2020a, 2020b), object detection (Ren et al, 2016) and retrieval (Babenko et al, 2014) The key to this success is the use of massively labeled data and effective deep learning models. As for unsupervised learning of video representations, a lot of work has been proposed in this direction in recent times which is based on self-supervised learning. Most of these methods are built over a single predefined pretext task (Benaim et al, 2020; Cho et al, 2021; Jing et al, 2018; Kim et al, 2019; Wang et al, 2020), which usually transforms video and train the network to predict the transformation.

Related Work
Convolutional Autoencoder (2D-CAE)
Network Architecture
Multi-task Learning (MTL) based on 3D-CAE
Implementation Details
Comparison to State-of-the-arts
Visualization
Ablation Study (Action Recognition)
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call