Human Action Recognition in Videos using Convolution Long Short-Term Memory Network with Spatio-Temporal Networks

Ashok Sarabu,Ajit Kumar Santra

doi:10.28991/esj-2021-01254

Abstract

Two-stream convolutional networks plays an essential role as a powerful feature extractor in human action recognition in videos. Recent studies have shown the importance of two-stream Convolutional Neural Networks (CNN) to recognize human action recognition. Recurrent Neural Networks (RNN) has achieved the best performance in video activity recognition combining CNN. Encouraged by CNN's results with RNN, we present a two-stream network with two CNNs and Convolution Long-Short Term Memory (CLSTM). First, we extricate Spatio-temporal features using two CNNs using pre-trained ImageNet models. Second, the results of two CNNs from step one are combined and fed as input to the CLSTM to get the overall classification score. We also explored the various fusion function performance that combines two CNNs and the effects of feature mapping at different layers. And, conclude the best fusion function along with layer number. To avoid the problem of overfitting, we adopt the data augmentation techniques. Our proposed model demonstrates a substantial improvement compared to the current two-stream methods on the benchmark datasets with 70.4% on HMDB-51 and 95.4% on UCF-101 using the pre-trained ImageNet model. Doi: 10.28991/esj-2021-01254 Full Text: PDF

Highlights

Human Action Recognition (HAR) in videos has received tremendous attention in the realm of pattern recognition and computer vision academic and research community because of its broad spectrum of applications like video monitoring, video retrieving, human-computer interaction, medical applications, etc
Driven by the rapid growth in the performance of deep Convolutional Neural Networks (CNN) models, the computer vision academic and research community started to expand the application of CNNs to human action recognition [1, 2]
The proposed methodology consists of a pre-trained CNN model, data pre-processing for temporal stream network, Convolution Long-Short Term Memory (CLSTM)

Summary

1- Introduction

Human Action Recognition (HAR) in videos has received tremendous attention in the realm of pattern recognition and computer vision academic and research community because of its broad spectrum of applications like video monitoring, video retrieving, human-computer interaction, medical applications, etc. Karpathy et al [1] proposed different performance analysis solutions to the video activity classification on SPORTS-1M dataset and showed the several CNN models outcomes. The final prediction is calculated by fusing the results of two CNNs. many researchers have explored the two-stream network architectures and proven with good performance. Since the input of the RNN is the output of CNN, it converts the three-dimensional feature maps to one-dimensional feature vectors [6] Doing this process will decrease the number of parameters compared to its previous work, and this process will diminish the spatial information. Xingjian et al [7] extended Long Short Term Memory (LSTM) to three-dimensional and proposed CLSTM, proved with better performance We further extend this method with different architecture; that is, we trained two streams architecture/model end-to-end and fuse the output and feed it as input to CLSTM. Experimental section discussion of implementation details and comparison with State-of-art results

2- Related Works

4- Experiments

5- Conclusion

6-1- Data Availability Statement

Findings

7- References

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Emerging Science Journal	Publication Date: Feb 1, 2021
Citations: 26	License type: cc-by

R Discovery Prime

R Discovery Prime

Human Action Recognition in Videos using Convolution Long Short-Term Memory Network with Spatio-Temporal Networks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Emerging Science Journal

Lead the way for us

Similar Papers

Analysis of CNN Architectures for Human Action Recognition in Video
David Silva ... Fernando Gaxiola
Computación y Sistemas | VOL. 26
David Silva, et. al.David Silva ... Fernando Gaxiola
30 Jun 2022
Computación y Sistemas | VOL. 26

Learning correlations for human action recognition in videos
Yun Yi ... Hanli Wang
Multimedia Tools and Applications | VOL. 76
Yun Yi, et. al.Yun Yi ... Hanli Wang
10 Feb 2017
Multimedia Tools and Applications | VOL. 76

Multi-stream with Deep Convolutional Neural Networks for Human Action Recognition in Videos
Xiao Liu ... Xudong Yang
-
Xiao Liu, et. al.Xiao Liu ... Xudong Yang
01 Jan 2018
01 Jan 2018

Human action recognition in surveillance video of a computer laboratory
Abdul-Lateef Yussiff ... Baharum B Baharudin
-
Abdul-Lateef Yussiff, et. al.Abdul-Lateef Yussiff ... Baharum B Baharudin
01 Aug 2016
01 Aug 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Human Action Recognition in Videos using Convolution Long Short-Term Memory Network with Spatio-Temporal Networks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Emerging Science Journal