Abstract

Complex action segmentation aims to detect what actions and when they happen in fine-grained level from long videos. Despite the fact that videos are often stored in a compressed format (e.g., MPEG-4), most existing approaches are proposed to directly model raw RGB videos: when only compressed videos are accessible, they have to first decode these videos, which is very time-consuming. In this paper, by explicitly leveraging the ‘compressed’ characteristic of compressed videos, we are the first to address the challenging task of complex action segmentation in compressed videos. To extract meaningful representations for complex action segmentation, we introduce the GOP-Level Compressed features (Golec), which can be obtained directly from compressed videos without video decompression. Importantly, by taking GOPs as the atomic units of actions, our Golec representation is intrinsically suitable for fine-grained action segmentation. Moreover, to remedy the coarser motion vectors (compared with optical flows which are computed from raw frames) used in our Golec representation for capturing the temporal context, we propose a new Bi-path knowledge distillation strategy. Extensive experiments show the effectiveness of our Golec representation and the Bi-path strategy. Importantly, our proposed model for complex action detection not only runs 5.2 times faster but also achieves significantly better results than the state-of-the-art alternatives using raw videos.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call