Abstract

Two-stream convolutional networks plays an essential role as a powerful feature extractor in human action recognition in videos. Recent studies have shown the importance of two-stream Convolutional Neural Networks (CNN) to recognize human action recognition. Recurrent Neural Networks (RNN) has achieved the best performance in video activity recognition combining CNN. Encouraged by CNN's results with RNN, we present a two-stream network with two CNNs and Convolution Long-Short Term Memory (CLSTM). First, we extricate Spatio-temporal features using two CNNs using pre-trained ImageNet models. Second, the results of two CNNs from step one are combined and fed as input to the CLSTM to get the overall classification score. We also explored the various fusion function performance that combines two CNNs and the effects of feature mapping at different layers. And, conclude the best fusion function along with layer number. To avoid the problem of overfitting, we adopt the data augmentation techniques. Our proposed model demonstrates a substantial improvement compared to the current two-stream methods on the benchmark datasets with 70.4% on HMDB-51 and 95.4% on UCF-101 using the pre-trained ImageNet model. Doi: 10.28991/esj-2021-01254 Full Text: PDF

Highlights

  • Human Action Recognition (HAR) in videos has received tremendous attention in the realm of pattern recognition and computer vision academic and research community because of its broad spectrum of applications like video monitoring, video retrieving, human-computer interaction, medical applications, etc

  • Driven by the rapid growth in the performance of deep Convolutional Neural Networks (CNN) models, the computer vision academic and research community started to expand the application of CNNs to human action recognition [1, 2]

  • The proposed methodology consists of a pre-trained CNN model, data pre-processing for temporal stream network, Convolution Long-Short Term Memory (CLSTM)

Read more

Summary

1- Introduction

Human Action Recognition (HAR) in videos has received tremendous attention in the realm of pattern recognition and computer vision academic and research community because of its broad spectrum of applications like video monitoring, video retrieving, human-computer interaction, medical applications, etc. Karpathy et al [1] proposed different performance analysis solutions to the video activity classification on SPORTS-1M dataset and showed the several CNN models outcomes. The final prediction is calculated by fusing the results of two CNNs. many researchers have explored the two-stream network architectures and proven with good performance. Since the input of the RNN is the output of CNN, it converts the three-dimensional feature maps to one-dimensional feature vectors [6] Doing this process will decrease the number of parameters compared to its previous work, and this process will diminish the spatial information. Xingjian et al [7] extended Long Short Term Memory (LSTM) to three-dimensional and proposed CLSTM, proved with better performance We further extend this method with different architecture; that is, we trained two streams architecture/model end-to-end and fuse the output and feed it as input to CLSTM. Experimental section discussion of implementation details and comparison with State-of-art results

2- Related Works
4- Experiments
5- Conclusion
6-1- Data Availability Statement
Findings
7- References

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.