Abstract
In the field of video representation, the self-supervised learning method was efficiently applied to the pre-training domain to downstream tasks using large and many unlabeled datasets. The basic approaches are typically based on a pre-text task method and a contrastive learning method. First, in a pre-text task method, a user defines a new problem and uses it as a proxy for self-supervised learning. Second, contrastive learning is a method of predicting the relationship between instances by assuming that feature values extracted through a certain model will have similar information between instances. According to the recent popularity of unsupervised learning, various self-supervised methods as well as the above methods are used in the field of video representation learning. Effective video representation learning is performed by fusing the multimodality advantages of video and the features of audio-visual information with various deep learning techniques. In this paper, recent representative methods of self-supervised video representation learning are summarized and described. Additionally, we provide a brief overview of how to utilize multimodality (audio-visual) information, which is the strength of the video.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.