Examining the relationship between streetscape features and road traffic accidents is pivotal for enhancing roadway safety. While previous studies have primarily focused on the influence of street design characteristics, sociodemographic features, and land use features on crash occurrence, the impact of streetscape features on pedestrian crashes has not been thoroughly investigated. Furthermore, while machine learning models demonstrate high accuracy in prediction and are increasingly utilized in traffic safety research, understanding the prediction results poses challenges. To address these gaps, this study extracts streetscape environment characteristics from street view images (SVIs) using a combination of semantic segmentation and object detection deep learning networks. These characteristics are then incorporated into the eXtreme Gradient Boosting (XGBoost) algorithm, along with a set of control variables, to model the occurrence of pedestrian crashes at intersections. Subsequently, the SHapley Additive exPlanations (SHAP) method is integrated with XGBoost to establish an interpretable framework for exploring the association between pedestrian crash occurrence and the surrounding streetscape built environment. The results are interpreted from global, local, and regional perspectives. The findings indicate that, from a global perspective, traffic volume and commercial land use are significant contributors to pedestrian–vehicle collisions at intersections, while road, person, and vehicle elements extracted from SVIs are associated with higher risks of pedestrian crash onset. At a local level, the XGBoost-SHAP framework enables quantification of features’ local contributions for individual intersections, revealing spatial heterogeneity in factors influencing pedestrian crashes. From a regional perspective, similar intersections can be grouped to define geographical regions, facilitating the formulation of spatially responsive strategies for distinct regions to reduce traffic accidents. This approach can potentially enhance the quality and accuracy of local policy making. These findings underscore the underlying relationship between streetscape-level environmental characteristics and vehicle–pedestrian crashes. The integration of SVIs and deep learning techniques offers a visually descriptive portrayal of the streetscape environment at locations where traffic crashes occur at eye level. The proposed framework not only achieves excellent prediction performance but also enhances understanding of traffic crash occurrences, offering guidance for optimizing traffic accident prevention and treatment programs.