Abstract

In this paper, we propose a novel mid-level feature representation for the recognition of actions in videos. This descriptor proves to posses relevant discriminative power when used in a generic action recognition pipeline. It is well known that mid-level feature descriptors learnt using class-oriented information are potentially more distinctive than the low-level features extracted in a bottom-up unsupervised fashion. In this regard, we introduce the notion of concepts, a mid-level feature representation capable of tracking the dynamics of motion salient regions over consecutive frames in a video sequence. Our feature representation is based on the idea of region correspondence over consecutive frames and we make use of an unsupervised iterative bipartite graph matching algorithm to extract representative visual concepts from action videos. The progression of such salient regions, which are also consistent in appearance, are henceforth represented as chain graphs. Finally, we adopt an intuitive time-series pooling strategy to extract discriminant features from the chains, which are then used in a dictionary learning based classification framework. Given the high variability of the movements of different human body parts in separate actions, the extracted conceptual descriptors are proved to capture the different dynamic characteristics by exclusively encoding the interaction parts associated to the chains. Further, we use such descriptors in a semi-supervised, clustering-based zero-shot action recognition setting, showing good performance and without resorting to costly attribute annotation. We validate the proposed framework on four public datasets namely KTH, UCF-101, HOHA and HMDB-51, reporting increased (and comparable in some cases) classification accuracies with respect to the state of the art.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call