MINet: A Pedestrian Trajectory Forecasting Method with Multi-Information Feature Fusion
Pedestrian trajectory prediction plays an exceptionally vital role in autonomous driving, enabling advanced analysis and decision-making in certain scenarios to ensure driving safety. Predicting pedestrian trajectories is a highly complex task, encompassing static scenes, dynamic scenes, and subjective intent. To enhance the accuracy of pedestrian trajectory prediction, it is crucial to model these scenarios, extract relevant features, and fuse them effectively. However, existing methods only consider some of the scenarios mentioned above and extract static scene features through manual annotation of road key points, which fails to meet the demands of autonomous driving in complex traffic scenarios. To overcome these limitations, this paper introduces MINet -- a network that employs multi-information feature fusion. Unlike previous approaches, MINet adopts a more automated approach to extract static scenes, including sidewalks and lawns. Moreover, the network incorporates pedestrian destination modeling to improve prediction accuracy. Furthermore, to tackle the challenge of collision avoidance in crowded spaces, this paper incorporates the extraction of dynamic scene changes through relative velocity modeling of objects. The proposed network achieved an improvement of 47.7 % in the ADE metric and 62.6 % in the FDE metric on the ETH/UCY dataset. In the SDD dataset, there was an improvement of 18.4 % in the ADE metric and 35.2 % in the FDE metric.
- Research Article
- 10.1088/1742-6596/2917/1/012036
- Dec 1, 2024
- Journal of Physics: Conference Series
Predicting pedestrian trajectories is an essential task within the realms of computer vision and artificial intelligence, aiming to forecast the future motion paths of pedestrians based on historical data. Current research primarily focuses on extracting factors that influence pedestrian movement and independently modeling these factors to predict future pedestrian trajectories. However, existing methods overlook the varying degrees of influence that different factors may have on pedestrian movement at different time steps. To address this issue, we, after extracting the influencing factors, employ a model-adaptive approach to model the importance of different factors at each time step and further predict pedestrian trajectories. Specifically, we utilize a simple multi-factor feature extraction method to extract various influencing factors. Subsequently, we employ a Transformer-based dynamic weight fusion mechanism to dynamically adjust the weights of different factors on pedestrian movement at different time steps, thereby achieving pedestrian trajectory prediction. Our method demonstrates significant superiority in dense crowds, with experimental results showing an improvement of 6% and 31% over state-of-the-art approaches.
- Research Article
- 10.24425/bpasts.2024.151960
- Oct 9, 2024
- Bulletin of the Polish Academy of Sciences Technical Sciences
Pedestrian trajectory prediction provides crucial data support for the development of smart cities. Existing pedestrian trajectory prediction methods often overlook the different types of pedestrian interactions and the micro-level spatial-temporal relationships when handling the interaction information in spatial and temporal dimensions. The model employs a spatial-temporal attention-based fusion graph convolutional framework to predict future pedestrian trajectories. For the different types of local and global relationships between pedestrians, it first employs spatial-temporal attention mechanisms to capture dependencies in pedestrian sequence data, obtaining the social interactions of pedestrians in spatial contexts and the movement trends of pedestrians over time. Subsequently, a fusion graph convolutional module merges the temporal weight matrix and the spatial weight matrix into a spatial-temporal fusion feature map. Finally, a decoder section utilizes time-stacked convolutional neural networks to predict future trajectories. The final validation on the ETH and UCY datasets yielded experimental results with an average displacement error (ADE) of 0.34 and an final displacement error (FDE) of 0.55. The visualization results further demonstrated the rationality of the model.
- Research Article
3
- 10.1109/tim.2025.3551031
- Jan 1, 2025
- IEEE Transactions on Instrumentation and Measurement
Pedestrian trajectory prediction has broad application prospects in multiple fields. It can be integrated into various instruments and meters, such as millimeter-wave radars and LiDAR devices, as well as camera surveillance equipment in intelligent security systems. Currently, many trajectory prediction methods use overhead camera sensors to obtain scene information and historical trajectory data. But they also have some shortcomings. On one hand, there is the challenge of how to effectively integrate feature information and capture the complex interactions between data. On the other hand, there is the issue of how to accurately measure the difference between the model’s predicted values and the actual values and reduce cumulative errors during training. The commonly used concatenation fusion method in some approaches may fail to fully consider the intrinsic correlations and differences among different features, simply concatenating them together. Therefore, in this article, we adopt the average fusion method to integrate static scene information from overhead cameras, historical trajectory information of pedestrians, and social interaction information. This method can reduce redundant information and allow implicit interactions among feature information within the model, thereby capturing more complex data relationships. Second, we embed the information fractal equation into the loss function of the model. This fractal loss term is designed to measure the detail loss in the model, reduce the accumulated errors during training, and enhance prediction accuracy. Finally, we use generative adversarial network (GAN) to generate more realistic trajectories. Experiments on the ETH/UCY dataset demonstrate a significant improvement in trajectory prediction accuracy, particularly in complex situations. Compared with the baseline model, the proposed method reduces errors by 14.8% in average displacement error (ADE) and by 25.2% in final displacement error (FDE). Furthermore, the model’s training time is also effectively reduced.
- Conference Article
4
- 10.1109/itsc55140.2022.9921990
- Oct 8, 2022
With the rapid development of communication technology, cooperative perception between vehicles and infrastructure improves the perception performance in complex scenarios. Previous studies have explored different point cloud fusion modes but ignored the cooperative analysis of perception precision and communication latency. Focusing on the problem of point cloud fusion, a latency impact analysis framework based on simulation for three typical fusion modes, former fusion, feature fusion, and postfusion, is proposed. First, the relationship between the mean average precision and distribution of translation errors of different fusion modes is described, based on which simulation trajectories are generated. The extended Kalman filter algorithm is then applied to predict and compensate for the lagged cooperative perception results. The indices lag compensation error (LCE) and equivalent latency are proposed to evaluate the final effect. Finally, numerical simulations of different point cloud fusion modes and latencies are conducted based on the TrajNet++ pedestrian trajectory dataset. The results show that the LCE is positively correlated with the latency and object speed and negatively correlated with the length of the historical trajectory and perception accuracy. Therefore, the postfusion mode with low latency and cooperative perception accuracy should be adopted for complex scenarios where the objects consistently appear suddenly and are moving fast. Conversely, the former fusion mode with high perception accuracy should be adopted. The research results provide a basis for the point cloud fusion mode selection and applicability of cooperative perception in an internet of vehicles environment.
- Book Chapter
1
- 10.3233/atde250256
- Jun 4, 2025
With the development of autonomous driving technology, there are still many challenges to accurately tracking pedestrians in complex environments, such as occlusion, dense crowds, and light variations. This study aims to design a highly adaptive pedestrian tracking system to improve the reliability of the autonomous driving perception system. In this paper, an innovative solution combining YOLOv8-pose and improved Bot-sort algorithm is proposed to integrate the detection bounding box, appearance features, and 17-point pose information through a multimodal feature fusion strategy, and the matching cost matrix is redesigned to enhance the tracking performance. Experimental results show that the proposed pose feature enhancement strategy significantly improves the system’s capability in similar appearance pedestrian differentiation and trajectory continuity, and it is well adapted to scenarios such as occlusion, dense crowds, and lighting changes. Meanwhile, the system maintains the real-time processing performance and provides reliable support for the automatic driving perception system, demonstrating the potential and value of multimodal feature fusion for pedestrian tracking in complex environments.
- Research Article
- 10.1109/tits.2025.3578023
- Jan 1, 2025
- IEEE Transactions on Intelligent Transportation Systems
Pedestrian intention and trajectory prediction are crucial for advancing intelligent transportation systems and autonomous vehicles, significantly enhancing urban mobility’s safety and efficiency. Traditional approaches have evolved from capturing pedestrian dynamics through image features and bounding box coordinates to leveraging multiple modalities and attention mechanisms. However, challenges in robust cross-modal feature integration and adaptation to complex scenarios persist. This paper introduces a dual-task approach that simultaneously predicts short-term pedestrian crossing intentions and long-term trajectories by integrating features from pedestrian regions of interest (ROIs), scene attributes, and past trajectories. For crossing intention prediction, Progressive Denoising Attention (PDA) is developed, which iteratively refines cross-modal features to augment inter-class variations. Additionally, a three-phase counterfactual training approach is employed that manipulates pedestrian ROIs and segmentation maps to further enhance model robustness in complex scenarios. For trajectory prediction, a Conditional Variational Autoencoder (CVAE) is implemented, guided by contextual embeddings from the novel Context-Aware Feature Fusion Module (CAFFM) to significantly reduce mean squared error by integrating rich spatiotemporal ROI and context information. Experimental results on benchmark datasets JAAD and PIE demonstrate the superior performance of the proposed approach in understanding and predicting pedestrian intent. The code is available at: https://github.com/neha013/DPITRA
- Conference Article
1
- 10.1109/robio64047.2024.10907592
- Dec 10, 2024
Accurately predicting the future trajectories of surrounding pedestrians is of great significance for the safety decision-making of intelligent vehicles. In this paper, a pedestrian trajectory prediction method based on multi-stream information fusion from a egocentric perspective is proposed. First, the raw images are preprocessed into pedestrian detection boxes and corresponding optical flow maps. Then, the detection box sequences, pedestrian region of interest (ROI) optical flow, and fixed center region optical flow are fed into three GRU-based feature extraction channels. A cross-attention mechanism is designed to closely integrate the bounding box and optical flow information. Finally, the motion information of the ego vehicle is compensated into the trajectory prediction to enhance prediction accuracy. Experiments on the PIE dataset demonstrate the effectiveness of the proposed method. Compared to current state-of-the-art methods, our approach shows an advantage in prediction accuracy.
- Research Article
10
- 10.1016/j.engappai.2023.107370
- Oct 30, 2023
- Engineering Applications of Artificial Intelligence
SemNav-HRO: A target-driven semantic navigation strategy with human–robot–object ternary fusion
- Research Article
3
- 10.3390/electronics13173460
- Aug 31, 2024
- Electronics
With the acceleration of urbanization and the growing demand for traffic safety, developing intelligent systems capable of accurately recognizing and tracking pedestrian trajectories at night or under low-light conditions has become a research focus in the field of transportation. This study aims to improve the accuracy and real-time performance of nighttime pedestrian-detection and -tracking. A method that integrates the multi-object detection algorithm YOLOP with the multi-object tracking algorithm DeepSORT is proposed. The improved YOLOP algorithm incorporates the C2f-faster structure in the Backbone and Neck sections, enhancing feature extraction capabilities. Additionally, a BiFormer attention mechanism is introduced to focus on the recognition of small-area features, the CARAFE module is added to improve shallow feature fusion, and the DyHead dynamic target-detection head is employed for comprehensive fusion. In terms of tracking, the ShuffleNetV2 lightweight module is integrated to reduce model parameters and network complexity. Experimental results demonstrate that the proposed FBCD-YOLOP model improves lane detection accuracy by 5.1%, increases the IoU metric by 0.8%, and enhances detection speed by 25 FPS compared to the baseline model. The accuracy of nighttime pedestrian-detection reached 89.6%, representing improvements of 1.3%, 0.9%, and 3.8% over the single-task YOLO v5, multi-task TDL-YOLO, and the original YOLOP models, respectively. These enhancements significantly improve the model’s detection performance in complex nighttime environments. The enhanced DeepSORT algorithm achieved an MOTA of 86.3% and an MOTP of 84.9%, with ID switch occurrences reduced to 5. Compared to the ByteTrack and StrongSORT algorithms, MOTA improved by 2.9% and 0.4%, respectively. Additionally, network parameters were reduced by 63.6%, significantly enhancing the real-time performance of nighttime pedestrian-detection and -tracking, making it highly suitable for deployment on intelligent edge computing surveillance platforms.
- Research Article
12
- 10.1109/tits.2024.3421373
- Oct 1, 2024
- IEEE Transactions on Intelligent Transportation Systems
Trajectory prediction is an important task in autonomous driving and monitoring systems. Most of the existing methods pay little attention to the rapidly changing trajectory information, but how to effectively solve this problem is crucial to ensure pedestrian safety. The Gabor transform has inherent advantages for capturing instantaneously changing information. Therefore, for the first time, we introduce the Gabor transformation idea into pedestrian trajectory prediction and propose the Multi-scale Learnable Gabor Transform Network (MlgtNet), which establishes global and local contextual relationships from multi-dimensional and multi-scale perspectives. The network first uses the Multi-scale Feature Dimension Enhancement Module (MFDEM) ascending dimension trajectory sequence, and uses the Multi-scale Gabor Convolution Module (MGCM) to guide the model to establish the dependence of different distances from different dimensions to model the interrelationship between global/local features at different scales and different step sizes. Finally, the Feature Fusion Module (FFM) processes the multimodal information and fuses it with the multi-scale trajectory features to obtain the trajectory prediction representation in different visual fields. The representation results are then used for secondary fusion to obtain the global prediction results. Experimental results show that MlgtNet achieves state-of-the-art performance with its lightweight model size on the vast majority of widely used trajectory prediction datasets from different perspectives.
- Book Chapter
- 10.3233/atde250765
- Oct 1, 2025
Pedestrian trajectory prediction is one of the core technologies of autonomous driving and intelligent monitoring systems. The difficulty lies in how to accurately model the dynamic interaction between pedestrians in complex scenes and effectively integrate multi-modal environmental information. Aiming at the shortcomings of existing methods in dynamic spatial relation-ship modeling and multi-modal feature fusion, this paper proposes a dynamic graph-guided multi-modal spatio-temporal hierarchical network (DGMS-Net). Firstly, the adjacency matrix is calculated in real time through the dynamic graph convolution network to adaptively capture the spatio-temporal dependence between pedestrians. Secondly, a visual-trajectory multimodal fusion module is designed to enhance the trajectory context awareness ability by using scene semantic features. Finally, a hierarchical spatio-temporal modeling framework is constructed, and the joint representation of spatial interaction and long-term temporal dependence is optimized by combining graph convolution and Transformer encoder. Experiments on ETH/UCY benchmark datasets show that the average displacement error and final displacement error of DGMS-Net are reduced by 11.1% and 10.3% respectively compared with the existing optimal methods. This study provides an efficient and interpretable solution for pedestrian trajectory prediction in complex urban environments.