Self-supervised temporal autoencoder for egocentric action segmentation

Mingming Zhang,Dong Liu,Shizhe Hu,Xiaoqiang Yan,Zhongchuan Sun,Yangdong Ye

doi:10.1016/j.engappai.2023.107092

Abstract

Given an egocentric video, action temporal segmentation aims to temporally segment the video into basic units, each depicting an action. As the camera is constantly moving, some important objects may disappear in some consecutive frames and cause an abrupt change in the visual content. Recently works fail to deal with this condition in the absence of manually annotating abundant frames. In this study, we propose a temporal-aware clustering method for egocentric action temporal segmentation: a self-supervised temporal autoencoder (SSTAE). Instead of directly learning visual features, the SSTAE is implemented by encoding the preceding target frame and predicting subsequent frames in the temporal relationship domain, which takes into account the local temporal consistency. Our proposed algorithm is guided by the reconstruction and predicted losses. Consequently, local temporal contexts are naturally integrated into the feature representation, and a clustering step is performed. Experiments on three egocentric datasets demonstrate the our proposed approach outperforms the state-of-the-art methods by clustering Accuracy(ACC) 7.57%, Normalized Mutual Information(NMI) 8.17%, Adjusted Rand Index(ARI) 8.6%.

Full Text