Learning Unsupervised Visual Representations using 3D Convolutional Autoencoder with Temporal Contrastive Modeling for Video Retrieval

Vidit Kumar,Vikas Tripathi,Bhaskar Pant

doi:10.33889/ijmems.2022.7.2.018

Vidit Kumar, Vikas Tripathi + Show 1 more

Open Access

https://doi.org/10.33889/ijmems.2022.7.2.018

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text
Similar Papers

Abstract

Listen

The rapid growth of tag-free user-generated videos (on the Internet), surgical recorded videos, and surveillance videos has necessitated the need for effective content-based video retrieval systems. Earlier methods for video representations are based on hand-crafted, which hardly performed well on the video retrieval tasks. Subsequently, deep learning methods have successfully demonstrated their effectiveness in both image and video-related tasks, but at the cost of creating massively labeled datasets. Thus, the economic solution is to use freely available unlabeled web videos for representation learning. In this regard, most of the recently developed methods are based on solving a single pretext task using 2D or 3D convolutional network. However, this paper designs and studies a 3D convolutional autoencoder (3D-CAE) for video representation learning (since it does not require labels). Further, this paper proposes a new unsupervised video feature learning method based on joint learning of past and future prediction using 3D-CAE with temporal contrastive learning. The experiments are conducted on UCF-101 and HMDB-51 datasets, where the proposed approach achieves better retrieval performance than state-of-the-art. In the ablation study, the action recognition task is performed by fine-tuning the unsupervised pre-trained model where it outperforms other methods, which further confirms the superiority of our method in learning underlying features. Such an unsupervised representation learning approach could also benefit the medical domain, where it is expensive to create large label datasets.

Highlights

Most of these videos are unlabeled or semantic less tagged, making video analysis and searching a difficult task. These falsely semantically tagged clips or misrepresented short videos are created to entice or mislead consumers by posing as fake news (Cao et al, 2020). Other sources such as news agencies and surveillance networks have emerged in large quantities of video recording, Kumar et al.: Learning Unsupervised Visual Representations using 3D Convolutional Autoencoder
With future frames prediction task and past frames prediction task, the features learned on top of 3D convolutional autoencoder (3D-CAE) show further improvement which is reflected in retrieval accuracy
A novel unsupervised video representation learning technique is proposed, where video features are learned via joint learning of future frames and past frames prediction pretext task

Summary

Introduction

Since the inception of the Internet, the number of videos produced, uploaded, and downloaded from the World Wide Web has been expanding constantly. Deep learning has emerged as successful and powerful in computer vision tasks that include classification (Karpathy et al, 2014; Krizhevsky et al, 2012), segmentation (Shelhamer et al, 2017), gesture recognition (Jain et al, 2020a, 2020b), object detection (Ren et al, 2016) and retrieval (Babenko et al, 2014) The key to this success is the use of massively labeled data and effective deep learning models. As for unsupervised learning of video representations, a lot of work has been proposed in this direction in recent times which is based on self-supervised learning. Most of these methods are built over a single predefined pretext task (Benaim et al, 2020; Cho et al, 2021; Jing et al, 2018; Kim et al, 2019; Wang et al, 2020), which usually transforms video and train the network to predict the transformation.

Related Work

Convolutional Autoencoder (2D-CAE)

Network Architecture

Multi-task Learning (MTL) based on 3D-CAE

Implementation Details

Comparison to State-of-the-arts

Visualization

Ablation Study (Action Recognition)

Conclusion

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Mathematical, Engineering and Management Sciences	Publication Date: Mar 14, 2022
Citations: 8	License type: cc-by

R Discovery Prime

Learning Unsupervised Visual Representations using 3D Convolutional Autoencoder with Temporal Contrastive Modeling for Video Retrieval

Abstract

Highlights

Summary

Published Version

Talk to us

Similar Papers

More From: International Journal of Mathematical, Engineering and Management Sciences

Lead the way for us

Similar Papers

Repeat and learn: Self-supervised visual representations learning by Repeated Scene Localization
Hussein Altabrawee ... Mohd Halim Mohd Noor
Pattern Recognition | VOL. 156
Hussein Altabrawee, et. al.Hussein Altabrawee ... Mohd Halim Mohd Noor
18 Jul 2024
Pattern Recognition | VOL. 156

Anchor-Based Spatio-Temporal Attention 3-D Convolutional Networks for Dynamic 3-D Point Cloud Sequences
Guangming Wang ... Zhe Liu
IEEE Transactions on Instrumentation and Measurement | VOL. 70
Guangming Wang, et. al.Guangming Wang ... Zhe Liu
01 Jan 2020
IEEE Transactions on Instrumentation and Measurement | VOL. 70

Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion
Jinpeng Wang ... Xing Sun
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 35
Jinpeng Wang, et. al.Jinpeng Wang ... Xing Sun
18 May 2021
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 35

RDCGAN: Unsupervised Representation Learning With Regularized Deep Convolutional Generative Adversarial Networks
Mehran Mehralian ... Babak Karasfi
-
Mehran Mehralian, et. al.Mehran Mehralian ... Babak Karasfi
01 Dec 2018
01 Dec 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Learning Unsupervised Visual Representations using 3D Convolutional Autoencoder with Temporal Contrastive Modeling for Video Retrieval

Abstract

Highlights

Summary

Published Version

Talk to us

Similar Papers

More From: International Journal of Mathematical, Engineering and Management Sciences