Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks

Jun-Hwa Kim,Chee Sun Won

doi:10.1109/access.2020.2983427

Abstract

A pre-trained 2D CNN (Convolutional Neural Network) can be used for the spatial stream in the two-stream CNN structure for videos, treating the representative frame selected from the video as an input. However, the CNN for the temporal stream in the two-stream CNN needs training from scratch using the optical flow frames, which demands expensive computations. In this paper, we propose to adopt a pre-trained 2D CNN for the temporal stream to avoid the optical flow computations. Specifically, three RGB frames selected at three different times in the video sequence are converted into grayscale images and are assigned to three R(red), G(green), and B(blue) channels, respectively, to form a Stacked Grayscale 3-channel Image (SG3I). Then, the pre-trained 2D CNN is fine-tuned by SG3Is for the temporal stream CNN. Therefore, only pre-trained 2D CNNs are used for both spatial and temporal streams. To learn long-range temporal motions in videos, we can use multiple SG3Is by partitioning the video shot into sub-shots and a single SG3I is generated for each sub-shot. Experimental results show that our two-stream CNN with the proposed SG3Is is about 14.6 times faster than the first version of the two-stream CNN with the optical flow, and yet achieves a similar recognition accuracy for UCF-101 and a 5.7% better result for HMDB-51.

Highlights

Huge amounts of video data are being generated and stored with a growing number of camera-equipped mobile devices, which demands automatic solutions for various video recognition problems
Existing two-stream Convolutional Neural Networks (CNN) for videos rely on computationally expensive processes such as optical flow computations or high dimensional convolutions with 3D convolutional kernels
In this paper we have proposed a two-stream CNN, which adopts only the pre-trained 2D CNNs for both the spatial and temporal stream CNNs

Summary

Introduction

Huge amounts of video data are being generated and stored with a growing number of camera-equipped mobile devices, which demands automatic solutions for various video recognition problems. Compared to images, the video recognition is still in its infancy. With the great success of deep neural networks on still images, it is highly expected that they will be of benefit to video problems such as action recognition as well. Unlike many existing pre-trained 2D Convolutional Neural Networks (CNN) for still images, no such 3D CNNs pre-trained by generalpurpose video datasets are available. It is required to train a 3D CNN from scratch with a domain-specific video dataset, which demands a lot of training videos and computing power. An alternative is to exploit the pre-trained 2D CNNs trained by still images for videos. We can adopt a pre-trained 2D CNN for the spatial stream in the two-stream CNN [1], where the representative frames

Objectives

Methods

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE access : practical innovations, open solutions	Publication Date: Jan 1, 2020
Citations: 22	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions

Lead the way for us

Similar Papers

R(2+1)D-based Two-stream CNN for Human Activities Recognition in Videos
Min Huang ... Yi Han
-
Min Huang, et. al.Min Huang ... Yi Han
26 Jul 2021
26 Jul 2021

Improved two-stream model for human action recognition
Yuxuan Zhao ... Kamran Siddique
EURASIP Journal on Image and Video Processing | VOL. 2020
Yuxuan Zhao, et. al.Yuxuan Zhao ... Kamran Siddique
17 Jun 2020
EURASIP Journal on Image and Video Processing | VOL. 2020

Distinct Two-Stream Convolutional Networks for Human Action Recognition in Videos Using Segment-Based Temporal Modeling
Ashok Sarabu ... Ajit Kumar Santra
Data | VOL. 5
Ashok Sarabu, et. al.Ashok Sarabu ... Ajit Kumar Santra
11 Nov 2020
Data | VOL. 5

An automated and efficient convolutional architecture for disguise-invariant face recognition using noise-based data augmentation and deep transfer learning
Adil Masood Siddiqui ... Muhammad Jaleed Khan
The Visual Computer | VOL. 38
Adil Masood Siddiqui, et. al.Adil Masood Siddiqui ... Muhammad Jaleed Khan
07 Jan 2021
The Visual Computer | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Action Recognition in Videos Using Pre-Trained 2D Convolutional Neural Networks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE access : practical innovations, open solutions