Behavior Recognition Algorithm Based on the Fusion of SE-R3D and LSTM Network

Jin Wu,Yi Yuan An,Wei Dai,Qian Wen Shi

doi:10.1109/access.2021.3119609

Jin Wu, Yi Yuan An + Show 2 more

Open Access

https://doi.org/10.1109/access.2021.3119609

Copy DOI

Abstract

In view of the fact that the existing behavior recognition algorithms can not fully extract abstract behavior features, this paper proposes a SE-R3D-LSTM behavior recognition algorithm based on 3D residual convolutional neural network (R3D), which integrates Squeeze-and-excitation network (SENet) and long short-term memory (LSTM). First of all, a residual module is added to the 3D Convolutional Neural Network (3D-CNN) to avoid problems such as gradient dispersion caused by the deepening of the network layer; Secondly, not only the global average pooling layer but also the global maximum pooling layer is used in the SENet network, which can fully extract global information and achieve feature calibration. In the meantime, expand the SENet network to three-dimensional, which can make the connection of the spatiotemporal feature channels closer. Afterwards, the 3D-SE module is introduced into the R3D network, which can enhance the effective spatiotemporal features and suppress the invalid spatiotemporal features; Since, because LSTM can perform timing modeling on high-level features and learn more effective feature information, the LSTM network is introduced into the SE-R3D network. Finally, Softmax is used for classification. Experimental results show that the recognition rate of the SE-R3D-LSTM network on the UCF101 data set reaches 96.5%.

Highlights

In the 1970s, Professor Johansson[1] proposed a description method of the human body model structure
In order to verify the effectiveness of the algorithm proposed in this paper, SE-R3D network and SE-R3D-long short-term memory (LSTM) network are tested on the HMDB-51 data set
Aiming at the low recognition rate of traditional algorithms, this paper proposes a behavior recognition algorithm of SER3D-LSTM

Summary

INTRODUCTION

In the 1970s, Professor Johansson[1] proposed a description method of the human body model structure. The two-stream network only uses stacked video frames as multiple input channels, and does not process the video frame sequence in time sequence, so it is difficult to extract spatiotemporal motion information In this regard, in 2015 Donahue J et al proposed a Longterm Recurrent Convolutional Network (LRCN)[17], which uses CNN for static feature information. The convolutional layer of the C3D network model all uses 3DCNN Experiments show that this network is more suitable for learning space-time features than 2D-CNN. The difficulties comes from the performance of the machine, the size of the dataset, the depth and width of the network, etc Based on this, He Yuming et al.[25] proposed Residual Neural Network (ResNet), whose structure is to emulate the VGG-Net, but introduces the residual module between the layers to avoid the vanishing gradient and network degradation caused by the deepening of the network[26,27]. The SE-R3D-LSTM network achieved a 96.5% recognition rate on the UCF-101[32] dataset

SE-R3D-LSTM NETWORK STRUCTURE AND AlGORITHM DESIGN

EXPERIMENTS AND ANALYSIS

6: Hard disk

Findings

CONCLUSION