Spatio-temporal Action Localization Research Articles

PurposeThe purpose of this paper is to provide a fast and accurate network for spatiotemporal action localization in videos. It detects human actions both in time and space simultaneously in real-time, which is applicable in real-world scenarios such as safety monitoring and collaborative assembly.Design/methodology/approachThis paper design an end-to-end deep learning network called collaborator only watch once (COWO). COWO recognizes the ongoing human activities in real-time with enhanced accuracy. COWO inherits from the architecture of you only watch once (YOWO), known to be the best performing network for online action localization to date, but with three major structural modifications: COWO enhances the intraclass compactness and enlarges the interclass separability in the feature level. A new correlation channel fusion and attention mechanism are designed based on the Pearson correlation coefficient. Accordingly, a correction loss function is designed. This function minimizes the same class distance and enhances the intraclass compactness. Use a probabilistic K-means clustering technique for selecting the initial seed points. The idea behind this is that the initial distance between cluster centers should be as considerable as possible. CIOU regression loss function is applied instead of the Smooth L1 loss function to help the model converge stably.FindingsCOWO outperforms the original YOWO with improvements of frame mAP 3% and 2.1% at a speed of 35.12 fps. Compared with the two-stream, T-CNN, C3D, the improvement is about 5% and 14.5% when applied to J-HMDB-21, UCF101-24 and AGOT data sets.Originality/valueCOWO extends more flexibility for assembly scenarios as it perceives spatiotemporal human actions in real-time. It contributes to many real-world scenarios such as safety monitoring and collaborative assembly.

Spatio-temporal action localization consists of three levels of tasks: spatial localization, action classification, and temporal localization. In this work, we propose a new progressive cross-stream cooperation (PCSC) framework that improves all three tasks above. The basic idea is to utilize both spatial region (resp., temporal segment proposals) and features from one stream (i.e., the Flow/RGB stream) to help another stream (i.e., the RGB/Flow stream) to iteratively generate better bounding boxes in the spatial domain (resp., temporal segments in the temporal domain). In this way, not only the actions could be more accurately localized both spatially and temporally, but also the action classes could be predicted more precisely. Specifically, we first combine the latest region proposals (for spatial detection) or segment proposals (for temporal localization) from both streams to form a larger set of labelled training samples to help learn better action detection or segment detection models. Second, to learn better representations, we also propose a new message passing approach to pass information from one stream to another stream, which also leads to better action detection and segment detection models. By first using our newly proposed PCSC framework for spatial localization at the frame-level and then applying our temporal PCSC framework for temporal localization at the tube-level, the action localization results are progressively improved at both the frame level and the video level. Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate the effectiveness of our newly proposed approaches for spatio-temporal action localization in realistic scenarios.

Spatio-temporal Action Localization Research Articles

Related Topics

Articles published on Spatio-temporal Action Localization

Real-time spatiotemporal action localization algorithm using improved CNNs architecture

You watch once more: a more effective CNN architecture for video spatio-temporal action localization

Improved Algorithm of Spatio-Temporal Action Localization Based on YOWO

Com-STAL: Compositional Spatio-Temporal Action Localization

Spatio-temporal human action localization in indoor surveillances

The joint detection and classification model for spatiotemporal action localization of primates in a group

COWO: towards real-time spatiotemporal action localization in videos

GLNet: Global Local Network for Weakly Supervised Action Localization

Progressive Cross-Stream Cooperation in Spatial and Temporal Domain for Action Localization

Learning motion representation for real-time spatio-temporal action localization

Learning a strong detector for action localization in videos

ML-HDP: A Hierarchical Bayesian Nonparametric Model for Recognizing Human Actions in Video

Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation.

Spatio-temporal action localization and detection for human action recognition in big dataset

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Spatio-temporal Action Localization Research Articles

Related Topics

Articles published on Spatio-temporal Action Localization

Real-time spatiotemporal action localization algorithm using improved CNNs architecture

You watch once more: a more effective CNN architecture for video spatio-temporal action localization

Improved Algorithm of Spatio-Temporal Action Localization Based on YOWO

Com-STAL: Compositional Spatio-Temporal Action Localization

Spatio-temporal human action localization in indoor surveillances

The joint detection and classification model for spatiotemporal action localization of primates in a group

COWO: towards real-time spatiotemporal action localization in videos

GLNet: Global Local Network for Weakly Supervised Action Localization

Progressive Cross-Stream Cooperation in Spatial and Temporal Domain for Action Localization

Learning motion representation for real-time spatio-temporal action localization

Learning a strong detector for action localization in videos

ML-HDP: A Hierarchical Bayesian Nonparametric Model for Recognizing Human Actions in Video

Segment-Tube: Spatio-Temporal Action Localization in Untrimmed Videos with Per-Frame Segmentation.

Spatio-temporal action localization and detection for human action recognition in big dataset