Abstract

Due to rapid technological advancements, the growth of videos uploaded to the internet has increased exponentially. Most of these videos are free of semantic tags, which makes indexing and retrieval a challenging task, and requires much-needed effective content-based analysis techniques to deal with. On the other hand, supervised representation learning from large-scale labeled dataset demonstrated great success in the image domain. However, creating such a large scale labeled database for videos is expensive and time consuming. To this end, we propose an unsupervised visual representation learning framework, which aims to learn spatiotemporal features by exploiting two pretext tasks i.e. rotation prediction and future frame prediction. The performance of the learned features is analyzed by the nearest neighbor task (video retrieval). For this, we choose the UCF-101 dataset to experiment with. The experimental results shows the competitive performance achieve by our method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.