Abstract

An essential prerequisite for autonomous vehicles deploying in urban scenarios is the ability to accurately recognize the behavioral intentions of pedestrians and other vulnerable road users and take measures to ensure their safety. In this paper, a spatial-temporal feature fusion-based multi-attention network (STFF-MANet) is designed to predict pedestrian crossing intention. Pedestrian information, vehicle information, scene context, and optical flow are extracted from continuous image sequences as feature sources. A lightweight 3D convolutional network is designed to extract temporal features from optical flow. Construct a spatial encoding module to extract the spatial features from the context. Pedestrian motion information are re-encoded using a collection of gated recurrent units. The final network structure is created through ablation research, which introduces attention mechanisms into the network to merge pedestrian motion features and spatio-temporal features. The efficiency of the suggested strategy is demonstrated by comparison experiments on the datasets JAAD and PIE. On the JAAD dataset, the intent recognition accuracy is 9% more accurate than the existing techniques.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call