This paper focuses on the problem of online action detection for interactive systems, with a special emphasis on earliness. Online Action Detection (OAD) refers to the challenging task of recognizing gestures in untrimmed, streaming videos where the actions occur in unpredictable orders and durations. To address these challenges, we present a skeleton-based system for OAD incorporating a decision mechanism to accurately detect ongoing gestures. This allows us to provide instance-level output, achieving a high level of stream understanding. This mechanism relies on a novel Connectionist Temporal Classification (CTC) loss design that restricts the path possibilities according to the action boundaries. We also present a mechanism to tune the trade-off between accuracy and earliness according to the needs of the interactive system using a weighted label prior. This system includes a 3D CNN network, referred to as DOLT-C3D, exploiting the spatial–temporal information provided by the euclidean skeleton representation. We extensively evaluate our approach on eight publicly available datasets, demonstrating its superior performance compared to state-of-the-art methods in terms of both accuracy and earliness. We also successfully applied our approach to early 2D gestures detection. Furthermore, our system shows real-time performance, making it a suitable choice for interactive systems.
Read full abstract7-days of FREE Audio papers, translation & more with Prime
7-days of FREE Prime access