Abstract

In the field of video representation, the self-supervised learning method was efficiently applied to the pre-training domain to downstream tasks using large and many unlabeled datasets. The basic approaches are typically based on a pre-text task method and a contrastive learning method. First, in a pre-text task method, a user defines a new problem and uses it as a proxy for self-supervised learning. Second, contrastive learning is a method of predicting the relationship between instances by assuming that feature values extracted through a certain model will have similar information between instances. According to the recent popularity of unsupervised learning, various self-supervised methods as well as the above methods are used in the field of video representation learning. Effective video representation learning is performed by fusing the multimodality advantages of video and the features of audio-visual information with various deep learning techniques. In this paper, recent representative methods of self-supervised video representation learning are summarized and described. Additionally, we provide a brief overview of how to utilize multimodality (audio-visual) information, which is the strength of the video.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call