Abstract

Human action recognition has attracted extensive research efforts in recent years, in which traffic police gesture recognition is important for self-driving vehicles. One of the crucial challenges in this task is how to find a representation method based on spatial-temporal features. However, existing methods performed poorly in spatial and temporal information fusion, and how to extract features of traffic police gestures has not been well researched. This paper proposes an attention mechanism based on the improved spatial-temporal convolutional neural network (AMSTCNN) for traffic police gesture recognition. This method focuses on the action part of traffic police and uses the correlation between spatial and temporal features to recognize traffic police gestures, so as to ensure that traffic police gesture information is not lost. Specifically, AMSTCNN integrates spatial and temporal information, uses weight matching to pay more attention to the region where human action occurs, and extracts region proposals of the image. Finally, we use Softmax to classify actions after spatial-temporal feature fusion. AMSTCNN can strongly make use of the spatial-temporal information of videos and select effective features to reduce computation. Experiments on AVA and the Chinese traffic police gesture datasets show that our method is superior to several state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call