This research project explored into the intricacies of road traffic accidents severity in the UK, employing a potent combination of machine learning algorithms, econometric techniques, and traditional statistical methods to analyse longitudinal historical data. Our robust analysis framework includes descriptive, inferential, bivariate, multivariate methodologies, correlation analysis: Pearson's and Spearman's Rank Correlation Coefficient, multiple logistic regression models, Multicollinearity Assessment, and Model Validation. In addressing heteroscedasticity or autocorrelation in error terms, we've advanced the precision and reliability of our regression analyses using the Generalized Method of Moments (GMM). Additionally, our application of the Vector Autoregressive (VAR) model and the Autoregressive Integrated Moving Average (ARIMA) models have enabled accurate time series forecasting. With this approach, we've achieved superior predictive accuracy and marked by a Mean Absolute Scaled Error (MASE) of 0.800 and a Mean Error (ME) of -73.80 compared to a naive forecast. The project further extends its machine learning application by creating a random forest classifier model with a precision of 73%, a recall of 78%, and an F1-score of 73%. Building on this, we employed the H2O AutoML process to optimize our model selection, resulting in an XGBoost model that exhibits exceptional predictive power as evidenced by an RMSE of 0.1761205782994506 and MAE of 0.0874235576229789. Factor Analysis was leveraged to identify underlying variables or factors that explain the pattern of correlations within a set of observed variables. Scoring history, a tool to observe the model's performance throughout the training process was incorporated to ensure the highest possible performance of our machine learning models. We also incorporated Explainable AI (XAI) techniques, utilizing the SHAP (Shapley Additive Explanations) model to comprehend the contributing factors to accident severity. Features such as Driver_Home_Area_Type, Longitude, Driver_IMD_Decile, Road_Type, Casualty_Home_Area_Type, and Casualty_IMD_Decile were identified as significant influencers. Our research contributes to the nuanced understanding of traffic accident severity and demonstrates the potential of advanced statistical, econometric, machine learning techniques in informing evidence based interventions and policies for enhancing road safety.
Read full abstract