Abstract

In video action recognition, motion is a very crucial clue, which is usually represented by optical flow. However, optical flow is computationally expensive to obtain, which becomes the bottleneck for the efficiency of traditional action recognition algorithms. In this article, we propose a network called MV2Flow to learn motion representation efficiently from the signals in the compressed domain. To learn the network, three losses are defined. First, we select the classical TV-L1 flow as proxy ground truth to guide the learning. Besides, an unsupervised image reconstruction loss is proposed to further refine it. Moreover, toward the task of action recognition, the above two losses are combined with a motion content loss. To evaluate our approach, extensive experiments on two benchmark datasets UCF-101 and HMDB-51 are conducted. The motion representation generated with our MV2Flow has shown comparable classification performance on action recognition with TV-L1 flow, while operating at an over 200× faster speed. Based on our MV2Flow and 2D-CNN-based network, we have achieved state-of-the-art performance in the compressed domain. With 3D-CNN-based network, we also achieve comparable accuracy with higher inference speed than methods in the decoded domain setting.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call