Abstract

3D Convolution Neural Networks (CNNs), an important deep learning model, has good performance in recognizing actions in videos. When recognizing actions from videos, 3D CNNs usually down-sample in temporal dimension, leading to loss of the temporal information. To obtain more temporal information from the videos, this work proposed a new model based on the Inflated 3D ConvNet (I3D), named as I3D-T. Instead of using down-sample in temporal dimension, the proposed model applied the dilated convolution in temporal dimension to enlarge the receptive field. At the same time, a non-local feature gating block was designed in the model to learn the correlations between different feature maps. The experimental results showed that the proposed I3D-T has the state-of-art performance. Using RGB frames as input, the action recognition accuracies are respectively 95% and 74.8% in public dataset of UCF101 and HMDB-51.

Highlights

  • Action recognition based on videos is a task to recognize the human actions automatically in real-word videos [1]

  • This work sets the pooling step in neural networks from 2 × 2 × 2 to 1 × 2 × 2. This design will make the receptive field in temporal dimension smaller, we introduce the VOLUME 8, 2020

  • COMPARISON WITH THE STATE-OF-ART Having analysed the effect of temporal resolution and non-local feature gating block on the performance of action recognition, final experiments on all three testing splits of UCF101 and HMDB51 are implemented with our proposed methods and other state-of-the-art methods’ pre-trained by Kinetics (Table 5)

Read more

Summary

INTRODUCTION

Action recognition based on videos is a task to recognize the human actions automatically in real-word videos [1]. Y. Xu et al.: Action Recognition Using High Temporal Resolution 3D Neural Network Based on Dilated Convolution dilated convolution [22] to enlarge receptive field and keep high temporal resolution while training 3D CNNs. To improve neural networks’ performance of recognizing actions, methods from two aspects are usually considered. Some functional blocks like non-local block [25] and Temporal Transition Layer (TTL) [26] were designed to improve the performance of action recognition These methods failed to extract the correlation among different channels of a 3D CNN with respect to temporal and spatial features [27]. 2) A novel non-local feature gating (NFG) block is proposed which can learn the relationship between channels in the spatial-temporal dimension and improve the stability of the model. 3) Experimental results showed that this work obtain state-of-art performance on both UCF101 and HMDB51, either pre-trained by ImageNet or Kinetics

RELATED WORK
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.