Abstract

Human action recognition methods in videos based on deep convolutional neural networks usually use random cropping or its variants for data augmentation. However, this traditional data augmentation approach may generate many non-informative samples (video patches covering only a small part of the foreground or only the background) that are not related to a specific action. These samples can be regarded as noisy samples with incorrect labels, which reduces the overall action recognition performance. In this paper, we attempt to mitigate the impact of noisy samples by proposing an Auto-augmented Siamese Neural Network (ASNet). In this framework, we propose backpropagating salient patches and randomly cropped samples in the same iteration to perform gradient compensation to alleviate the adverse gradient effects of non-informative samples. Salient patches refer to the samples containing critical information for human action recognition. The generation of salient patches is formulated as a Markov decision process, and a reinforcement learning agent called SPA (Salient Patch Agent) is introduced to extract patches in a weakly supervised manner without extra labels. Extensive experiments were conducted on two well-known datasets UCF-101 and HMDB-51 to verify the effectiveness of the proposed SPA and ASNet.

Highlights

  • Video-based human action recognition is one of the key tasks in video understanding

  • We addressed the issue of using random cropping methods for data augmentation in convolutional neural networks (CNN)-based video action recognition: noisy samples through random

  • We addressed the issue of using randomgenerating cropping methods for data augmentation cropping will adversely affect the performance of the trained action recognition model

Read more

Summary

Introduction

Video-based human action recognition is one of the key tasks in video understanding. It provides a wide range of applications [1,2,3,4,5] in intelligent surveillance, health care, human–computer interaction, robot learning, etc. It is found that the data augmentation methods based on random cropping often generate non-informative samples (video patches covering only a small part of the foreground or only the background). These samples can be considered as noisy samples with incorrect labels. CNN in context stream receives input from data augmentation based on random cropping, and the CNN in saliency stream receives salient patches from SPA. We addressed the issue of using random cropping methods for data augmentation in CNN-based video action recognition: noisy samples through random. We proposed a Siamese neural network architecture that can reduce the negative cropping will adversely affect the performance of the trained action recognition impact of non-informative samples through gradient compensation and enhance model.

Deep Learning-Based Action Recognition
Data Augmentation
Saliency Detection for Action Recognition
Deep Reinforcement Learning in Action Recognition
ASNet Framework
Model Formulation
Salient Patch Agent
State and Action Space
Reward
Training of Salient Patch Agent
Datasets
Training of CNN
Training of ASNet
Inference Details
Comparison with Different Cropping Strategies
ASNet with Different Backbones
ASNet with Different Feature Fusion Strategies
Hyperparameters
Analysis of ASNet
Exploration of ASNet Architecture
Visualization of ASNet
Comparison with the State of the Art
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call