Abstract

How do humans recognize an action or an interaction in the real world? Due to the diversity of viewing perspectives, it is a challenge for humans to identify a regular activity when they observe it from an uncommon perspective. We argue that discriminative spatiotemporal information remains an essential cue for human action recognition. Most existing skeleton-based methods learn optimal representation based on the human-crafted criterion that requires many labeled data and much human effort. This article introduces adaptive skeleton-based neural networks to learn optimal spatiotemporal representation automatically through a data-driven manner. First, an adaptive skeleton representation transformation method (ASRT) is proposed to model view-variation data without hand-crafted criteria. Next, powered by a novel attentional LSTM (C3D-LSTM) encapsulated with 3-D-convolution, the proposed model could effectively enable memory blocks to learn short-term frame dependency and long-term relations. Hence, the proposed model can more accurately understand long-term or complex actions. Furthermore, a data enhancement-driven end-to-end training scheme is proposed to train key parameters under fewer training samples. Enhanced by learned high-performance spatiotemporal representation, the proposed model achieves state-of-the-art performance on five challenging benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call