Abstract

The multi-source visual information from ring cameras and stereo cameras provides a direct observation of the road, traffic conditions, and vehicle behavior. However, relying solely on visual information may not provide a complete environmental understanding. It is crucial for intelligent transportation systems to effectively utilize multi-source, multi-modal data to accurately predict the future motion trajectory of vehicles accurately. Therefore, this paper presents a new model for predicting multi-modal trajectories by integrating multi-source visual feature. A spatial–temporal cross attention fusion module is developed to capture the spatiotemporal interactions among vehicles, while leveraging the road’s geographic structure to improve prediction accuracy. The experimental results on the realistic dataset Argoverse 2 demonstrate that, in comparison to other methods, ours improves the metrics of minADE (Minimum Average Displacement Error), minFDE (Minimum Final Displacement Error), and MR (Miss Rate) by 1.08%, 3.15%, and 2.14% , respectively, in unimodal prediction. In multimodal prediction, the improvements are 5.47%, 4.46%, and 6.50%. Our method effectively captures the temporal and spatial characteristics of vehicle movement trajectories, making it suitable for autonomous driving applications.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.