Abstract

Given an egocentric video, action temporal segmentation aims to temporally segment the video into basic units, each depicting an action. As the camera is constantly moving, some important objects may disappear in some consecutive frames and cause an abrupt change in the visual content. Recently works fail to deal with this condition in the absence of manually annotating abundant frames. In this study, we propose a temporal-aware clustering method for egocentric action temporal segmentation: a self-supervised temporal autoencoder (SSTAE). Instead of directly learning visual features, the SSTAE is implemented by encoding the preceding target frame and predicting subsequent frames in the temporal relationship domain, which takes into account the local temporal consistency. Our proposed algorithm is guided by the reconstruction and predicted losses. Consequently, local temporal contexts are naturally integrated into the feature representation, and a clustering step is performed. Experiments on three egocentric datasets demonstrate the our proposed approach outperforms the state-of-the-art methods by clustering Accuracy(ACC) 7.57%, Normalized Mutual Information(NMI) 8.17%, Adjusted Rand Index(ARI) 8.6%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.