Abstract

Some recent studies have focused on deep learning based semi-supervised learning for action recognition. However, it is difficult to scale up their training because their input is RGB frames, the obtainment of which incurs computational and storage costs. In this paper, we propose a semi-supervised action recognition method that makes it easy to scale up the training by using features stored in compressed videos. Our method directly extracts multiple types of input features from compressed videos without any decoding and generates artificial labels of unlabeled videos through the ensembling of the predictions from these features. In addition to the standard supervised training on labeled videos, our models are trained to predict the artificial labels from strongly augmented features in unlabeled compressed videos. We show that our method is more efficient and achieves a better classification performance on some widely used datasets than conventional semi-supervised learning methods applying RGB frames.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call