Objectives This study aims to develop and validate a novel deep-learning model that predicts the severity of pedestrian-vehicle interactions at unsignalized intersections, distinctively integrating Transformer-based models with Multilayer Perceptrons (MLP). This approach leverages advanced feature analysis capabilities, offering a more direct and interpretable method than traditional models. Methods High-resolution optical cameras recorded detailed pedestrian and vehicle movements at study sites, with data processed to extract trajectories and convert them into real-world coordinates via precise georeferencing. Trained observers categorized interactions into safe passage, critical event, and conflict based on movement patterns, speeds, and accelerations. Fleiss Kappa statistic measured inter-rater agreement to ensure evaluator consistency. This study introduces a novel deep-learning model combining Transformer-based time series data capabilities with the classification strengths of a Multilayer Perceptron (MLP). Unlike traditional models, this approach focuses on feature analysis for greater interpretability. The model, trained on dynamic input variables from trajectory data, employs attention mechanisms to evaluate the significance of each input variable, offering deeper insights into factors influencing interaction severity. Results The model demonstrated high performance across different severity categories: safe interactions achieved a precision of 0.78, recall of 0.91, and F1-score of 0.84. In more severe categories like critical events and conflicts, precision and recall were even higher. Overall accuracy stood at 0.87, with both macro and weighted averages for precision, recall, and F1-score also at 0.87. The variable importance analysis, using attention scores from the proposed transformer model, identified ‘Vehicle Speed’ as the most significant input variable positively influencing severity. Conversely, ‘Approaching Angle’ and ‘Vehicle Distance from Conflict Point’ negatively impacted severity. Other significant factors included ‘Type of Vehicle’, ‘Pedestrian Speed’, and ‘Pedestrian Yaw Rate’, highlighting the complex interplay of behavioral and environmental factors in pedestrian-vehicle interactions. Conclusions This study introduces a deep-learning model that effectively predicts the severity of pedestrian-vehicle interactions at crosswalks, utilizing a Transformer-MLP hybrid architecture with high precision and recall across severity categories. Key factors influencing severity were identified, paving the way for further enhancements in real-time analysis and broader safety assessments in urban settings.