Abstract

The recent years have witnessed significant advances in deep video action recognition. However, the performances of deep learning-based video action recognition methods are limited in the case of relatively small samples or no labeled samples. Therefore, trying to use unlabeled video data to generate clustering label is essential for small sample learning and zero sample learning. In this paper, we propose a novel deep video action clustering network, which aims to learn the similarity relationship among the unlabeled video samples, and generate the clustering label for each video sample. Specifically, the proposed method simultaneously learns the spatio-temporal features and subspace representations under a jointly optimized framework. It consists of a 3D U-Net self-representation generator, a video-clip reconstruction discriminator, and a confidence-based feedback mechanism. The 3D U-Net self-representation generator learns the spatio-temporal features of the video clips and produces subspace representation matrix. Then, the similarity graph is constructed based on this subspace representation matrix, and the clustering result is obtained. In the learning procedure, a confidence-based feedback mechanism is designed to feed the high-confidence labels of partial samples back to further guide the subspace structure learning, so that the optimal result can be obtained. During training, the video-clip reconstruction discriminator is introduced to evaluate the reconstructed video clips, which is beneficial for capturing the discriminative spatio-temporal features. Experimental results on a video benchmark dataset demonstrate the effectiveness of the proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call