Abstract

Action recognition is a hot research direction of computer vision. How to deal with human action in untrimmed video in real time is a very significant challenge. It can be widely used in fields such as real-time monitoring. In this paper, we propose an end-to-end Deep Spatial-Temporal Network (DstNet) for action recognition and localisation. First of all, the untrimmed video is clipped into segments with fixed length. Then, the Convolutional 3 Dimension (C3D) network is used to extract highly dimensional features for each segment. Finally, the extracted feature sequences of several continual segments are input into Long Short-Term Memory (LSTM) network to find the intrinsic relationship among clipped segments to take action recognition and localisation simultaneously in the untrimmed video. While maintaining good accuracy, our network has the function of real-time video processing and has achieved good results in the standard evaluation performance of THUMOS14.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.