Abstract

Pedestrian trajectory prediction in crowd scenes plays a significant role in intelligent transportation systems. The main challenges are manifested in learning motion patterns and addressing future uncertainty. Typically, trajectory prediction is considered in two dimensions, including temporal dynamics modeling and social interactions capturing. For temporal dependencies, although existing models based on recurrent neural networks (RNNs) or convolutional neural networks (CNNs) achieve high performance on short-term prediction, they still suffer from limited scalability for long sequences. For social interactions, previous graph-based methods only consider fixed features but ignore dynamic interactions between pedestrians. Considering that the transformer network has a strong capability of capturing spatial and long-term temporal dynamics, we propose Long-Short Term Spatio-Temporal Aggregation (LSSTA) network for human trajectory prediction. First, a modern variant of graph neural networks, named spatial encoder, is presented to characterize spatial interactions between pedestrians. Second, LSSTA utilizes a transformer network to handle long-term temporal dependencies and aggregates the spatial and temporal features with a temporal convolution network (TCN). Thus, TCN is combined with the transformer to form a long-short term temporal dependency encoder. Additionally, multi-modal prediction is an efficient way to address future uncertainty. Existing auto-encoder modules are extended with static scene information and future ground truth for multi-modal trajectory prediction. Experimental results on complex scenes demonstrate the superior performance of our method in comparison to existing approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call