Abstract

The Two-stream convolution neural network (CNN) has proven a great success in action recognition in videos. The main idea is to train the two CNNs in order to learn spatial and temporal features separately, and two scores are combined to obtain final scores. In the literature, we observed that most of the methods use similar CNNs for two streams. In this paper, we design a two-stream CNN architecture with different CNNs for the two streams to learn spatial and temporal features. Temporal Segment Networks (TSN) is applied in order to retrieve long-range temporal features, and to differentiate the similar type of sub-action in videos. Data augmentation techniques are employed to prevent over-fitting. Advanced cross-modal pre-training is discussed and introduced to the proposed architecture in order to enhance the accuracy of action recognition. The proposed two-stream model is evaluated on two challenging action recognition datasets: HMDB-51 and UCF-101. The findings of the proposed architecture shows the significant performance increase and it outperforms the existing methods.

Highlights

  • Human Action Recognition is an emerging research area that has gained prominent attention in computer vision

  • We propose a two-stream convolution neural network (CNN) model for identifying actions in videos built on a two-stream network model

  • We evaluate the experiments with Residual Networks (ResNet)-50 and Inception-V2 models to verify the efficiency of advanced cross-modal pre-training technique discussed in the previous section, as mentioned above

Read more

Summary

Introduction

Human Action Recognition is an emerging research area that has gained prominent attention in computer vision. The researchers for the aforementioned methods are able to utilize the temporal component, but work only for a short time; in lengthy videos, information cannot persist for a long time To solve this problem, Wang et al [6] designed a video level segmental architecture, called Temporal Segment Networks that can efficiently learns the features and retrieve the long-range time-varying features from the videos. The other methods proposed in [5,6,7,8,9,10,11], by researchers utilized similar network models for two streams for human action recognition in videos. Inspired by the human visual cortex process, we proposed similar two-stream CNN architecture for action recognition in videos. The segment based temporal modeling technique for long-term temporal information better captures long-range information

Related Works
Space-Time Networks
Hybrid Networks
Two-Stream Networks
Technical Approach
Distinct Two-Stream Convolution Networks
Base Networks
Residual Network
Inception-V2
Segment-Based Temporal Modeling
Data Augmentation
Advanced Cross-Modal Pre-Training
Experiments
Datasets and Implementation Detials
Testing
Exploration Study
Comparison with State-of-the-Art
Findings
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.