A significant reduction in carbon dioxide (CO2) emissions caused by transportation is essential for attaining sustainable urban development. Carbon concentrations from road traffic in urban areas exhibit complex spatial patterns due to the impact of street configurations, mobile sources, and human activities. However, a comprehensive understanding of these patterns, which involve complex interactions, is still lacking due to the human perspective of road interface characteristics has not been taken into account. In this study, a mobile travel platform was constructed to collect both on-road navigation Street View Panoramas (OSVPs) and the corresponding CO2 concentrations. >100 thousand sample pairs that matched “street view-CO2 concentration” were obtained, covering 675.8 km of roads in Shenzhen, China. In addition, four ensemble learning (EL) models were utilized to establish nonlinear connections between the semantic and object features of streetscapes and CO2 concentrations. After performing EL fusion modeling, the predictive R2 in the test set exceeded 90 %, and the mean absolute error (MAE) was <3.2 ppm. The model was applied to Baidu Street View Panoramas (BSVPs) in Shenzhen to generate a map of average on-road CO2 with a 100 m resolution, and the Local Indicator of Spatial Association (LISA) was then used to identify high CO2 intensity spatial clusters. Additionally, the Light Gradient Boost-SHapley Additive exPlanation (LGB-SHAP) analysis revealed that vertically planted trees can reduce CO2 emissions from on-road sources. Moreover, the factors that affect on-road CO2 exhibit interaction and threshold effects. Street View Panoramas (SVPs) and Artificial Intelligence (AI) were adopted here to enhance the spatial measurement of on-road CO2 concentrations and the understanding of driving factors. Our approach facilitates the assessment and design of low-emission transportation in urban areas, which is critical for promoting sustainable traffic development.