Repeat and learn: Self-supervised visual representations learning by Repeated Scene Localization

Hussein Altabrawee,Mohd Halim Mohd Noor

doi:10.1016/j.patcog.2024.110804

Abstract

Large labeled datasets are crucial for video understanding progress. However, the labeling process is time-consuming, expensive, and tiresome. To overcome this impediment, various pretexts use the temporal coherence in videos to learn visual representations in a self-supervised manner. However, these pretexts (order verification and sequence sorting) struggle when encountering cyclic actions due to the label ambiguity problem. To overcome these limitations, we present a novel temporal pretext task to address self-supervised learning of visual representations from unlabeled videos. Repeated Scene Localization (RSL) is a multi-class classification pretext that involves changing the temporal order of the frames in a video by repeating a scene. Then, the network is trained to identify the modified video, localize the location of the repeated scene, and identify the unmodified original videos that do not have repeated scenes. We evaluated the proposed pretext on two benchmark datasets, UCF-101 and HMDB-51. The experimental results show that the proposed pretext achieves state-of-the-art results in action recognition and video retrieval tasks. In action recognition, our S3D model achieves 88.15% and 56.86% on UCF-101 and HMDB-51, respectively. It outperforms the current state-of-the-art by 1.05% and 3.26%. Our R(2+1)D-Adjacent model achieves 83.52% and 54.50% on UCF-101 and HMDB-51, respectively. It outperforms the single pretext tasks by 8.7% and 13.9%. In video retrieval, our R(2+1)D-Offset model outperforms the single pretext tasks by 4.68% and 1.1% Top 1 accuracies on UCF-101 and HMDB-51, respectively. The source code and the trained models are publicly available at https://github.com/Hussein-A-Hassan/RSL-Pretext.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Repeat and learn: Self-supervised visual representations learning by Repeated Scene Localization

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition

Lead the way for us

Similar Papers

Which One is Better? Self-supervised Temporal Coherence Learning for Skeleton Based Action Recognition
Bizhu Wu ... Linlin Shen
-
Bizhu Wu, et. al.Bizhu Wu ... Linlin Shen
10 Oct 2022
10 Oct 2022

Unsupervised Learning of Spatio-Temporal Representation with Multi-Task Learning for Video Retrieval
Vidit Kumar
-
Vidit KumarVidit Kumar
24 May 2022
24 May 2022

Volumetric white matter tract segmentation with nested self-supervised learning using sequential pretext tasks.
Qi Lu ... Chuyang Ye
Medical Image Analysis | VOL. 72
Qi Lu, et. al.Qi Lu ... Chuyang Ye
30 Apr 2021
Medical Image Analysis | VOL. 72

Self-Supervised Spatiotemporal Representation Learning by Exploiting Video Continuity
Hanwen Liang ... Yang Wang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 36
Hanwen Liang, et. al.Hanwen Liang ... Yang Wang
28 Jun 2022
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 36

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Repeat and learn: Self-supervised visual representations learning by Repeated Scene Localization

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition