Action recognition is intended to classify a video into a certain category by aggregating and summarizing its temporal and spatial information. Existing methods have achieved remarkable performance on the standard datasets. However, how to accurately recognize the action under occluded scenarios remains the major challenge in practical applications and is barely explored. Although previous augmentation methods have made some efforts by introducing additional modalities to deal with occlusion cases, the continuous occlusion related to temporal in the video is rarely focused. To tackle this issue, this paper takes the sample augmentation and feature representation into consideration. Specifically, we design a Dynamic Temporal-aware Erasing (DTE) augmentation strategy to enrich the occlusion samples. The proposed DTE builds upon an explicit analysis conditioned with temporal dimension along with the actors’ trajectories, which ensures the capability of carrying the temporal relation to an arbitrary spatial augmentation setting. Specifically, DTE makes the added augmentation samples more temporally consistent with obtaining the motion trajectories of the actors, enhancing the robustness against occlusions. Besides, we revisit the necessity of diverse backgrounds and propose Dynamic and Static Mutual Fitting (DSMF) to optimize the action recognition model. DSMF incorporates background auxiliary mutual fitting actors to distinguish features, which learns a smooth representation with the global temporal consistency. Extensive experiments conducted on standard benchmarks have proved that the proposed DSMF achieves competitive performance over powerful competitors.
Read full abstract