In open-world scenarios, the analysis of action events from multiple viewpoints is crucial for achieving a holistic understanding, a concept that resonates with human perception. However, achieving a synthesis of information from multiple viewpoints presents a challenge, as the data can induce inter-view regularization, complicating the learning process. This paper is the first to delve into Online Action Detection (OAD) through a multi-view lens, underscoring the value of cross-observation in enriching view-level information. By harnessing the spatiotemporal dynamics inherent in multi-view video sequences, an Annealing Temporal–Spatial Contrastive Learning (ATSCL) consisting of Annealing Temporal Contrastive Learning (ATCL) and Spatial Contrastive Learning (SCL) is proposed, optimized for compatibility with RNN-based models. ATCL employs an annealing temporal loss to uncover the intrinsic video structures via a temporal annealing sampling mechanism. Concurrently, SCL utilizes a spatial loss to draw representations from various viewpoints closer together, mitigating the regularization effects. The ATSCL liberates training multi-view OAD from the stringent requirements of synchronized training videos, enabling the execution of OAD tasks asynchronously. Experiments demonstrate that the RNN-based models realize an average improvement of 5.92% on the DAHLIA dataset, 3.36% on the IKEA ASM dataset, 2.92% on the BREAKFAST dataset and 0.6% (mAP) on the THUMOS’14 dataset following the integration of the ATSCL framework, underscoring the ATSCL’s efficacy across different RNN structures.
Read full abstract