Abstract

This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16–30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.

Highlights

  • Action recognition via monocular video has valuable applications in surveillance, healthcare, sports science and entertainment

  • To retain the same network depth as its 2D convolution network, we reduced the number of layers in the spatial convolution blocks and added layers to the temporal and spatio-temporal blocks

  • As our model utilized transfer learning to initialize our network parameters, we presented the number of pre-trained parameters for each network

Read more

Summary

Introduction

Action recognition via monocular video has valuable applications in surveillance, healthcare, sports science and entertainment. Deep learning methods such as convolutional neural network (CNN) [1] have demonstrated superior learning capabilities and potential in discovering underlying features when given a large number of training examples. An action in video sequences can be characterized by its spatial and temporal features across consecutive frames. Spatial features provide contextual information and visual appearance of the content, while temporal features define the motion dynamics that happens in the range of the video frames. Network performance often degrades when dealing with high variations of realistic and complex videos, due to major challenges such as occlusion, camera viewpoints, background clutter and variations in the subjects and motion involved

Objectives
Methods
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call