Abstract
Natural untrimmed videos provide rich visual content for self-supervised learning. Yet most previous efforts to learn spatio-temporal representations rely on manually trimmed videos, such as Kinetics dataset (Carreira and Zisserman 2017), resulting in limited diversity in visual patterns and limited performance gains. In this work, we aim to improve video representations by leveraging the rich information in natural untrimmed videos. For this purpose, we propose learning a hierarchy of temporal consistencies in videos, i.e., visual consistency and topical consistency, corresponding respectively to clip pairs that tend to be visually similar when separated by a short time span, and clip pairs that share similar topics when separated by a long time span. Specifically, we present a Hierarchical Consistency (HiCo++) learning framework, in which the visually consistent pairs are encouraged to share the same feature representations by contrastive learning, while topically consistent pairs are coupled through a topical classifier that distinguishes whether they are topic-related, i.e., from the same untrimmed video. Additionally, we impose a gradual sampling algorithm for the proposed hierarchical consistency learning, and demonstrate its theoretical superiority. Empirically, we show that HiCo++ can not only generate stronger representations on untrimmed videos, but also improve the representation quality when applied to trimmed videos. This contrasts with standard contrastive learning, which fails to learn powerful representations from untrimmed videos. Source code will be made available here.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE transactions on pattern analysis and machine intelligence
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.