The rapid growth of urbanization and motorization has significantly increased traffic crashes, leading to both loss of life and diminished quality of life for crash survivors and their families. Identifying the factors influencing crash fatality is crucial for reducing such incidents. However, traffic crashes are inherently unpredictable, and crash fatality datasets are often imbalanced. This study provides a comprehensive evaluation of various machine learning (ML) techniques to analyze traffic crash fatality using an imbalanced dataset. It is the first to train eight distinct binary classification models: Classification and Regression Trees (CART), Random Forest (RF), Gradient Boosting Machine (GBM), Extreme Gradient Boost (XGBoost), Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and Naïve Bayes (NB) under three strategies: in isolation, with bagging, and with optimized bagging techniques (Grid Search CV, Random Search CV, and Bayesian Optimization). To handle data imbalance, eight resampling methods were employed, including SMOTE, Random Under-sampling (RUS), Random Over-sampling (ROS), ADASYN, Tomek Links, Near Miss, SMOTETomek, and SMOTEENN. Results show that GBM, combined with Bayesian optimized bagging and RUS, achieved the best performance with a G-mean score of 65.23 and an F1 score of 60.06. This study offers valuable insights into effective ML techniques, data resampling methods, and advanced optimization strategies for imbalanced crash severity datasets.
Read full abstract