Optimizing Spatiotemporal Feature Learning in 3D Convolutional Neural Networks With Pooling Blocks

Rockson Agyeman,Hyun Kwang Shin,Muhammad Rafiq,Bernhard Rinner,Gyu Sang Choi

doi:10.1109/access.2021.3078295

Rockson Agyeman, Hyun Kwang Shin + Show 3 more

Open Access

https://doi.org/10.1109/access.2021.3078295

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 9	License type: CC BY 4.0

Affiliation: University of Klagenfurt, Yeungnam University

Abstract

Image data contain spatial information only, thus making two-dimensional (2D) Convolutional Neural Networks (CNN) ideal for solving image classification problems. On the other hand, video data contain both spatial and temporal information that must be simultaneously analyzed to solve action recognition problems. 3D CNNs are successfully used for these tasks, but they suffer from their extensive inherent parameter set. Increasing the network’s depth, as is common among 2D CNNs, and hence increasing the number of trainable parameters does not provide a good trade-off between accuracy and complexity of the 3D CNN. In this work, we propose Pooling Block (PB) as an enhanced pooling operation for optimizing action recognition by 3D CNNs. PB comprises three kernels of different sizes. The three kernels simultaneously sub-sample feature maps, and the outputs are concatenated into a single output vector. We compare our approach with three benchmark 3D CNNs (C3D, I3D, and Asymmetric 3D CNN) and three datasets (HMDB51, UCF101, and Kinetics 400). Our PB method yields significant improvement in 3D CNN performance with a comparatively small increase in the number of trainable parameters. We further investigate (1) the effect of video frame dimension and (2) the effect of the number of video frames on the performance of 3D CNNs using C3D as the benchmark.

Highlights

C ONVOLUTIONAL Neural Networks (CNN) have been studied extensively in the last decade and have become the preferred intelligence modeling algorithm for many computer vision tasks
Image data contain spatial information only, making two-dimensional (2D) Convolutional Neural Networks (CNN) ideal for solving image classification problems. Video data contain both spatial and temporal information that must be simultaneously analyzed to solve action recognition problems. 3D CNNs are successfully used for these tasks, but they suffer from their extensive inherent parameter set
Our objective is to improve the inference accuracy of 3D CNNs without significantly increasing the number of trainable parameters

Summary

Introduction

C ONVOLUTIONAL Neural Networks (CNN) have been studied extensively in the last decade and have become the preferred intelligence modeling algorithm for many computer vision tasks. CNNs can learn relevant higher-order information from structured data, which is believed to be similar to how the human brain learns. Many CNNs include data normalizing layers to perform what is commonly referred to as batch normalization (BN) [2]. BN significantly reduces the training time of very deep CNNs by reducing internal covariate shifts. Single-stream 2D CNNs have demonstrated exceptional performance in solving image classification (recognition) [3]–[7], object segmentation [8], and object detection problems [9], [10]. A single-stream 2D CNN cannot be applied to action recognition in video data because the recognition of dynamic actions requires the simultaneous analysis of spatial and temporal information, but 2D CNNs can learn either spatial or temporal information at a time

Objectives

Methods

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Optimizing Spatiotemporal Feature Learning in 3D Convolutional Neural Networks With Pooling Blocks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Three-dimensional Deep Convolutional Neural Networks for Automated Myocardial Scar Quantification in Hypertrophic Cardiomyopathy: A Multicenter Multivendor Study.
Ahmed S Fahmy ... Reza Nezafat
Radiology | VOL. 294
Ahmed S Fahmy, et. al.Ahmed S Fahmy ... Reza Nezafat
12 Nov 2019
Radiology | VOL. 294

Lightweight Multilevel Feature Fusion Network for Hyperspectral Image Classification
Quanyu Huang ... Weixin Xie
-
Quanyu Huang, et. al.Quanyu Huang ... Weixin Xie
26 Nov 2022
26 Nov 2022

Three-Stream 3D deep CNN for no-Reference stereoscopic video quality assessment
Hassan Imani ... Nafiz Arica
Intelligent Systems with Applications | VOL. 13
Hassan Imani, et. al.Hassan Imani ... Nafiz Arica
01 Jan 2021
Intelligent Systems with Applications | VOL. 13

Performance Comparison of 1D and 2D Convolutional Neural Networks for Real-Time Classification of Time Series Sensor Data
Syed Maaz Shahid ... Sungoh Kwon
-
Syed Maaz Shahid, et. al.Syed Maaz Shahid ... Sungoh Kwon
12 Jan 2022
12 Jan 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimizing Spatiotemporal Feature Learning in 3D Convolutional Neural Networks With Pooling Blocks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access