Abstract

Two-stream Convolutional Neural Networks have shown excellent performance in video action recognition. Most existing works train each sampling group independently, or just fuse at the last level, which obviously ignore the continuity of action in temporal and the complementary information between action fragments. In this paper, a temporal segment connection network is proposed to overcome these limitations. On the one hand, the forget gate module of the long short-term memory (LSTM) network is used to establish feature-level connections between each sampling group. This not only strengthens the information transmission between the sampling groups to enhance the temporal connectivity, but also extracts the complementary information between the sampling groups to enhance the overall representation of the action. On the other hand, a bi-directional long short-term memory (Bi-LSTM) network is used to automatically evaluate the importance weights of each sampling group based on the deep feature sequence. The experimental results on UCF101 and HMDB51 datasets show that the proposed model can effectively improve the utilization rate of temporal information and the ability of overall action representation, thus significantly improves the accuracy of human action recognition.

Highlights

  • Video-based action recognition attracts extensive attention due to its applications in many fields like security and behavior analysis

  • The original two-stream convolutional neural network can combine spatial and temporal information, but it only focuses on short-term motion changes and does not capture long-term information about the video

  • EXPERIMENTAL RESULTS AND DISCUSSIONS In order to verify the effect of the temporal segment connection network (TSCN) model on action recognition, the basic model Temporal Segment Network (TSN) is used for comparison on HMDB51 and UCF101

Read more

Summary

INTRODUCTION

Video-based action recognition attracts extensive attention due to its applications in many fields like security and behavior analysis. Video action recognition includes both spatial appearance information and temporal motion information. The original two-stream convolutional neural network can combine spatial and temporal information, but it only focuses on short-term motion changes and does not capture long-term information about the video. To address this issue, Wang and Xiong [15] proposed a Temporal Segment Network (TSN) to extract several sampling groups from a video to enhance the long-term modeling ability of the network. Q. Li et al.: TSCN for Action Recognition sampling groups depends on the heterogeneity between them. The more heterogeneity between the sampling groups contain more complementary information.

RELATED WORKS
FORGET-GATE CONNECTION MODULE
ADAPTIVE WEIGHTING MODULE
EXPERIMENT AND ANALYSIS
Findings
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.