A multi-modal spatial–temporal model for accurate motion forecasting with visual fusion

Xiaoding Wang,Jianmin Liu,Hui Lin,Sahil Garg,Mubarak Alrashoud

doi:10.1016/j.inffus.2023.102046

Xiaoding Wang, Jianmin Liu + Show 3 more

https://doi.org/10.1016/j.inffus.2023.102046

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

The multi-source visual information from ring cameras and stereo cameras provides a direct observation of the road, traffic conditions, and vehicle behavior. However, relying solely on visual information may not provide a complete environmental understanding. It is crucial for intelligent transportation systems to effectively utilize multi-source, multi-modal data to accurately predict the future motion trajectory of vehicles accurately. Therefore, this paper presents a new model for predicting multi-modal trajectories by integrating multi-source visual feature. A spatial–temporal cross attention fusion module is developed to capture the spatiotemporal interactions among vehicles, while leveraging the road’s geographic structure to improve prediction accuracy. The experimental results on the realistic dataset Argoverse 2 demonstrate that, in comparison to other methods, ours improves the metrics of minADE (Minimum Average Displacement Error), minFDE (Minimum Final Displacement Error), and MR (Miss Rate) by 1.08%, 3.15%, and 2.14% , respectively, in unimodal prediction. In multimodal prediction, the improvements are 5.47%, 4.46%, and 6.50%. Our method effectively captures the temporal and spatial characteristics of vehicle movement trajectories, making it suitable for autonomous driving applications.

Full Text